13

Lavinia Ferariu
EDITURA
CONSPRESS
2013
Copyright 2013, Editura Conspress i autorul

EDITURA CONSPRESS
este recunoscut de
Consiliul Naional al Cercetrii tiinifice din nvmntul Superior
Lucrare elaborat n cadrul proiectului: "Reea naional de centre pentru

dezvoltarea programelor de studii cu rute flexibile i a unor instrumente
didactice la specializarea de licen i masterat, din domeniul Ingineria
Sistemelor"
Descrierea CIP a Bibliotecii Naionale a Romniei

FERARIU, LAVINIA
Lecture Notes for Hybrid Intelligent Systems / Lavinia Ferariu
Bucureti : Conspress, 2013
Bibliogr.
ISBN 978-973-100-286-6
004
Carte universitar
CONSPRESS
B-dul Lacul Tei nr.124, sector 2,
cod 020396, Bucureti
Tel.: (021) 242 2719 / 300; Fax: (021) 242 0781
CONTENT
Chapter 1. Introduction ------------------------------------------------------------ 1

Chapter 2. Genetic Algorithms --------------------------------------------------- 4
Introduction ----------------------------------------------------------------------- 4
Genetic algorithms - overview-------------------------------------------------- 7
Evolutionary algorithm an artificial intelligence technique -------------16
Main research directions--------------------------------------------------------16
Genetic encoding ----------------------------------------------------------------17
Population initialization --------------------------------------------------------22
Genetic operators. Crossover and mutation. ---------------------------------22
Selection for recombination----------------------------------------------------29
Insertion (selection for survival)-----------------------------------------------42
GA convergence -----------------------------------------------------------------48
Parallel GA -----------------------------------------------------------------------55
Benchmarks for GA evaluation------------------------------------------------60
Chapter 3. Artificial Neural Networks ------------------------------------------62
Artificial neuron -----------------------------------------------------------------62
ANN architectures---------------------------------------------------------------66
Multi-layer Perceptron (MLP) -------------------------------------------------72
ANN with Radial basis functions - RBF -------------------------------------82
Chapter 4. Neuro-genetic systems-----------------------------------------------94
Supportive neuro-genetic systems---------------------------------------------95
Collaborative neuro-genetic systems -----------------------------------------98
ii
CHAPTER 1. INTRODUCTION
Intelligence =
The capacity of improving the own behavior based on acquired experience (by
repeating the same action or similar actions)
[Back, 2000].
Intelligence learning capacity + adaptability.

learning = creation and modification of knowledge representations
+
adaptation = improvement of system performances, as response to environmental
changes
Human intelligence vs. Machine learning
Model of human cognitive development (Piaget)

Schema = inner model of an activity.
- build via repeated actions.
- improved (assimilation) or extended (accommodation)
as response to events which trigger unstable states
Human psychological development

- continuous assimilation and accommodation of the cognitive schemes.
Stages:
focused on movement control and sensors
o information acquired by the sensors - organized and processed;
o the first cognitive schemes are constructed; they refer mainly to own
body/behavior and neighbor objects;
preoperational
o symbolic thinking;
o capacity of generalization;
concrete operations
o deductive reasoning;
o higher interest for surrounding environment;
formal operations
o use of abstract concepts,
o specification and verification of the working assumptions, etc.
Learning strategies classification

- ascending sorting in relation to inference complexity:
learning by heart
o memorizing, without inference.
learning with instructor:
o the instructor provides the information;
o information is selected, rephrased and integrated with the available
knowledge.
deductive learning:
o new conclusions are deduced from the available knowledge.
learning by analogy:
o available useful knowledge is transformed to tackle a new (similar)
situation.
inductive learning:
by examples (acquisition of concepts):
look for universal rules describing all positive and negative examples.
by observations and discovery (unsupervised):
look for universal rules describing the observations
observations are obtained without a supervisor.
Short history:
Beginning: 1950s
Main research directions:
automatic proof of theorems, planning and prediction,
programming, human language understanding,
= > requirements for building the Machine Learning.
Artificial intelligence was successful for well delimited problems,

only.
automatic
Knowledge representation
by symbols (classic)
a formal set of primitives and rules are employed for symbols handling:
o predicates,
o frames, semantic networks,
o fuzzy systems;
by numbers (sub-symbolic):
o Artificial neural networks,
o Evolutionary algorithms.
CHAPTER 2. GENETIC ALGORITHMS

2.1. Introduction
Short history. Main research areas
First ideas regarding the evolution of species - Charles Darwin (1859).

Darwinian Theory: species go through a continuous development process.
Variations could occur during any species evolvement and these variations are
transmitted to offspring.
Best adapted individuals and species have greater chances of survival and development.
Evolution represents a natural selection of inherited variations.
Recently, Neo-Darwinism explains the mechanisms of inheritance based on Darwinian theory.
Modern Genetics studies the way in which information is encrypted by living organisms.
Evolutionary computation - translates the natural selection theory and evolution theory
to numerical algorithms.
The natural model is adopted in a simplified version

Evolutionary algorithms work on a population of structures which is
evolved for several generations
Best adapted structures survive to the next generation and contribute to

the production of new, better adapted offspring.
First trials: 1950, Bremermann, Friedberg, Box

Problem revisited in 1960-1970: Holland, Rechenberg, Schwefel, Fogel
The reputation of the approach increases significantly in 1980-1985.
Starting with 1985, several specialized conferences are organized.
Since 1990, the involved research effort have increased exponentially.
Most common evolutionary algorithms:

Genetic algorithms
General adaptive process applicable to any optimization problem.
A structure contained by the population encodes a point in the space of decision variables.
Holland, De Jong, Goldberg, Davis, Eshelman, Forrest, Grefestette, Koza, Mitchel, Riolo,
Schaffer.
Evolutionary programming
Goal: the design of finite state automatons able to predict the changes occurred in the working
environment.
The environment is described by a string of symbols (according to a finite encoding alphabet).
The algorithm searches the output symbol providing the fittest prediction.
Fogel, Burgin, Atmar.
Genetic programming
The algorithm searches the fittest program able to solve a certain problem.
Koza.
Evolutionary strategies
Meant to solve optimization problems with continuous parameters.
A structure encodes the values of the decision variables corresponding to a point of the search
space.
Unlike GA: other mechanisms are employed for enriching the genetic material throughout the
generations.
Rechenberg, Schwefel, Herdy, Kursawe, Ostermeier, Rudolph.
Classifiers
Devoted to the design of classifiers by means of evolutionary techniques.
Holland, Reitman, Booker, De Jong.
Almost of the research is targeted to applications:

- wide areas of applications;
- good results.
Theoretical background - insufficiently developed.

6
2.2. Genetic algorithms - overview

Genetic algorithms = search/ optimization method.
o It uses strategies borrowed from Genetics and Evolutionary Theories natural selection.
o It can approach complex optimizations:

o nonlinear optimizations,
o constrained optimizations,
o multiobjective optimizations.
Problem statement
Let us consider f : S R n R .
The elements of x S are called decision variables.
Find:
arg min f ( x ) or
arg max f ( x ) .
x S

x S
Objective
Objective function
Objective value
General description of GA
Every iteration (generation), a set (population) of potential solutions (individuals,

chromosomes) x S is considered.
The individuals are evaluated in terms of the objective and the best ones are encouraged to
survive and reproduce.
New potential solutions (offspring) are obtained by combining the genetic material of the
parents, similarly to the recombination of DNA chains in biological systems. This process
guides the exploration by using the most valuable genetic material of the current population.
Small variations of the offspring ensure an adequate preservation of population diversity, with
positive impact on avoiding the stagnation in local optima.
The offspring fight for survival with the old solutions. The best adapted solutions will have
greater chances to win this contest.
The process is repeated for an adequate number of generations. If no additional special

mechanisms are employed, the population converges toward a set including duplicates of the
best adapted individual found during exploration.
initialization:
t=0;
generate N random points (uniformly) distributed within the search space, to form the initial
population P(t);
repeat until t < No_Generations

step 1: evaluate P(t);
step 2: selection form the recombination pool with individuals selected from P(t);
step 3: recombination (crossover) produce the offspring using the parents selected at step 2;
step 4: mutation apply small variations on the offspring produced at step 3;
step 5: evaluate the offspring obtained at step 4;
step 6 : insertion - create P(t+1), by selecting N individuals from the offspring obtained after
step 5 and the samples contained in P(t);
pasul 7: t=t+1;
end of the loop
display the best individual of the population;

end of the algorithm
Encoding
Most common: binary encoding.
An individual of the population = a chains of characters

(allowed by the employed alphabet).
for binary encoding the alphabet is {0,1}.
The encoding ensures the mapping of the exploration space S to S*. The genetic operators will act in
S*.
The chain (string) used for encoding an individuals is called chromosome.
A position (character) in this chain is called gene or locus.
The values allowed for a certain gene are called alleles (e.g, for binary encoding 0 and 1).
The genotype indicates the structure of the chromosomes and the values of genes (it is related to S*)
The phenotype indicates the behavior of an individual obtained due to its specific genotype (it is related
to S).
Considering the optimization problem

f ( x) :
min
x S
x
chromosome = chain of genes
(space S )
(space S * )
phenotype
genotype
Genetic operators
Crossover - it works on two operands: 2 parents 2 offspring
by interchanging some sub-chains.
Depending on the number of the cutting points:

single cut point crossover

multiple cut point crossover
parent A
offspring C1

parent B
printe A
offspring C2
a) single cut point crossover
copil C1
parents
offspring
b) multiple cut point crossover
Mutation
It works on a single operand:

Some randomly selected genes will be modified.
E.g., for binary encoding 0 1 and 1 0 .
1 0 ....................0 1 0 ...........11
1 1 ....................0 1 0 ...........10
Mutation for binary encoding
10
Remarks:
Genetic operators act according to stochastic rules (their probability is smaller than 1):
not all the pairs of parents formed from the recombination pool are combined by means
crossover.
the cutting points and the mutated genes are stochastically selected.
Individuals evaluation. Selection for recombination and survival

The adaptation capacity of an individual is assessed in comparison with the competitors from the set
(population).
Usually, the quality of an individual is indicated as follows:
as absolute value, by means of the objective function

It indicates how much an individual fits the imposed objective.
in comparison with the other individuals of the set, by means of fitness.

It encapsulates a comparison between the performances of the individual and the
performances of its co- habitants.
It permits to choose the parents and the survivors.
11
Generally, an individual better than average is encouraged to survive and to produce offspring, because
it contains a genetic material better than other solutions of the current population.
Potential downsides:
By excessively encouraging the selection of superior individuals, the exploration is guided

toward restrained search areas of S*, containing the best individuals. The convergence speed is
high; however the exploration could stagnate around inconvenient solutions.
Even the worst solutions can generate fitted individuals by means of successive genetic changes
(performed via genetic operators).
To avoid the stagnation in local optima points and the premature convergence, an adequate
balance between convergence speed and diversity preservation is required. This balance is
mainly tuned by means of parents selection and offspring creation/insertion.
Stop criteria
Because the algorithm works randomly and in unsupervised manner, it is quite difficult to set
apriorically a proper stop condition.
The most common stop test deals with a maximum number of generations.
The maximum number of generations is tuned by trial and error.
Another stop criterion verifies if the differences produced on the individuals of the current
population become smaller than a predefined threshold.
If the individuals are still different, the evolutionary loop is continued. If the individuals become
too similar, the evolutionary loop is broken.
The allowed difference is difficult to set apriorically (more difficult than the number of
generations).
The encryption is very important, as small genotypic differences can involve big phenotypic
differences and vice versa.
12
The properties of genetic algorithms

When compared with other optimization methods, the main characteristics of GA could
be depicted as follows:
GA work in parallel on a population of solutions;

GA use stochastic transition rules;
GA use the objective values only; no other information is necessary (e.g. the derivatives of the
objective functions);
GA usually encrypt the set of decision variables (exception: GA based on float encoding).
Iterative optimization methods

Usually the algorithm start from an initial (known) solution, x0
Iteratively, x k x k +1 .
The goal is: lim x k = x* (global optimum)
k
0 order methods use the objective values, only;

- usually the objective values are computed in x k and some neighbors;
- examples: simulated annealing, hill climbing, Hook Jeeves, tabu search, GA, etc.;
1 order methods use the 1st order derivatives of the objective functions f
- assumption: the existence of the 1st order derivatives.
- example: deepest descent
the algorithm goes in the inverse gradient direction:
13
2 order methods - use the 2nd order derivatives of the objective functions
- assumption: the existence of the 2nd order derivatives
- the searching direction is inverse to the gradient one; the 2nd order derivatives impose
the search step at each iteration.
Deepest descent (gradient) method:

f
xi k +1 = xi k
( xi k ) , with > 0 .
xi
Downside: requirement: derivable objective function
x0 and > 0 , needed
high risk to lock in local optima.
Advantage: simplicity, high convergence speed
The objective function f(x)
200
100
0
-100
-
0
The derivative of the objective function
500
0
-500
-1000
-
200
100
0
-100
-
x*
xA
xB
xC
xE
xD
14
decision variable x
The advantages of genetic algorithms
They use the objective values, only UNIVERSALITY = can solve ANY optimization problem
(including those with discontinuous, non-differentiable objective functions )
These methods are called weak/soft, because they request scarce aprioric information about the
targeted problem.
Available additional information can be integrated within GA in order to improve the
exploration capability and/or the convergence speed (e.g. start from a particular initial
population).
GA are efficient for complex (nonlinear, multiobjective, constrained, multimodal) optimizations:
They can converge toward the GLOBAL optima.
GA are EASY to implement and accept FLEXIBLE configuration.
GA are suitable for PARALLEL implementation.
The drawbacks of genetic algorithms
GA request huge computational resources (time + memory).

The performances are sensitive to several algorithm parameters such as: probability of crossover and
mutation, population size, number of generations etc.
Random number generator strongly influences the algorithm performances
15
2.3. Evolutionary algorithm an artificial intelligence technique

Evolutionary algorithms
observations:
involve
unsupervised
inductive
learning
based
on
The examples (individuals) are created without a supervisor.

The generation of new examples is based on inductive learning best available
knowledge is employed.
Good examples are kept in the population, bad ones are eliminated by means of
selection.
Evolutionary algorithms make use of sub-symbolic chromosomal representation.
2.4. Main research directions

- as outlined by Fogel
Improved theoretical background. Theory explaining the behavior of evolutionary algorithms.
Available results refer to limited cases (incompliant with real applications).
Empirical research. Comparative analysis meant to reveal the effect of various techniques/
mechanisms and the influence of algorithm parameters.
Automatic setting of algorithm parameters. Meta-algorithms and adaptive approaches.
Co-evolutionary systems. Each individual is an agent which needs to cooperate with the others for
solving the problems. An agent also competes with all the other individuals for survival.
Studies in natural evolution. Improved interdisciplinary research for finding valuable ideas translated later to the numerical approaches.
16
2.5. Genetic encoding
genetic
algorithm
problem
genetic
algorithm
problem
modified
problem
modified
genetic
algorithm
a) change the problem statement

- encode the decision variables
b) change GA techniques to
cope with the original decision variables
Two basic approaches for GA
A. Change the problem statement in compliance with canonical GA.
The decision variables are encoded with a finite alphabet.
the standard GA could be applied without any changes;
the exploration space S is mapped to S*;

The selections are applied in S;
The genetic operators are applied in S*;
The evaluation needs decoding (from S* to S)
17
Encoding (usually not a bijection!!): S S*, in S* - a string of genes (each

decision variable has a specific substring).
Decoding (inverse!!) use for interpreting the significance of genes at the
evaluation stage
v11....v1l
x1
v11....v1l
(S * )
(S * )
x1
(S )
xn
encoding
vn1.....vnl
....
..
vn1.....vnl
S*
xn
decoding
(S )
S*
Genetic steps are carried out in different spaces:

Selection for reproduction and insertion in S,
Crossover and mutation in S*.
The population includes points from S*; their images in S are obtained by decoding.
selection
S
P(t)
P(t+1)
P(t)
decoding
encoding
S*
G(t)
G(t)
genetic operators
18
Key issue - find a proper encoding.

Fogel: the size of the (finite) encoding alphabet has no huge influence (resulted GA are
equivalent)
use the most intuitive one.
Most popular: GA with binary encoding (canonical genetic algorithms).

Each decision variable is encoded by a substring of 0 and 1.
For optimization problems involving continuous decision variables:

-
the designer must indicate the length of the chromosome, l:

vu
for xi [u , v] , the encoding step could be set q = l ; the same binary encoding could be
2
used for all the decision variables xi [u + j q, u + ( j + 1) q ) , with j = 0,2l 1 .
0 1 2
0 0 0
...........
l
0
0 1 2
1 0 0
...........
l
0
0 1 2
0 1 0
...........
l
0
0 1 2
1 1 1
...........
l
0
0 1 2
1 1 1
...........
l
1
u+3q
u+q
u
u+2q
v-q
................
Binary l bits length encoding for a decision variable from [u,v]
19
Example:
Let us consider the encoding of x1,2 [2,2] by means of 4 bits per decision variable:
v u 2 ( 2)
q= l =
= 4 / 16 = 1 / 4 .
2
24
o code 0000 is associated with xi (2,2 + 1 / 4] ; at the evaluation stage, the decoding leads to
0000 xi = 2 + 1 / 4 1 / 2 (the middle of the interval).

o code 0001 is associated with xi (2 + 1 / 4,2 + 1 / 2] , at the evaluation stage, the decoding
leads to 0001 xi = 2 + 1 / 4 3 / 2 (the middle of the interval).

0001 0000
is interpreted as x1 = 2 + 1 / 8 and x2 = 2 + 3 / 8 and
x1
x2
the objective value is accordingly computed.
o Therefore, the chromosome
o If the optimum point is x1 = x2 = 2 + 1 / 16 , then the best result of the algorithm could be
x1 = x2 = 2 + 2 / 16 , with an error of 1 / 16 introduced by the finite length encoding. The error
could be decreased by using longer chromosomal strings (which lead to smaller q ).
Disadvantages:
o the accuracy of the algorithm depends to the length of the chromosome very
long chromosome are required to explore large, highly dimensional search spaces.
o the encoding can increase the complexity of the problem e. g. the problem
becomes multimodal (it admits multiple global optima);
This could happen whenever ordering relationship for the distances in S is not
preserved for the distances in S* (big distances between two individuals in S
does not mean big distances for the same individuals in S* and vice versa).
Solution: change the encoding! e. g. use Gray binary encoding.
Remark: For binary encoding, similitude (in S*) could be analyzed with Hamming
distance
20
B. Modify GA techniques
Modified genetic operators!!!!
the decision variables are not encoded: their values are directly memorized in the
chromosomes (S* is not used);
new genetic operators are needed to work in S (assuming infinite encoding
alphabet);
Advantages:
The chromosomal representation is more natural;
Additional knowledge can be more easily incorporated within the algorithm;
No need of extra computational time for decoding.
Float encoding ( evolutive program) recommended for continuous decision

variables
Advantages:
The length of the chromosome = the number of the decision variables (independent
to requested accuracy and exploration range).
Similar chromosomes mean neighbor points.
The complexity of the optimization problem cannot be changed by encoding.
Disadvantages:
New genetic operators are needed.
Remarks:
Which approach is best? there is not a general answer
B intermediary results can be easily interpreted;
A good theoretical background.
The encoding (A or B) must be joined with the compatible genetic operators.
21
2.6. Population initialization

Usually: randomly generated, according to uniform distribution in S*.
.
Binary encoding:
For a population of N chromosomes (each one having l genes), one has to
generate l N bits (with equal probability for the occurrence of 0 and 1).
Bramlete extended random initialization:

For an individual more trials, the best sample is used.
Additional knowledge can be used for creating an adequate initial population
( increased speed)
expected localization of optima,
expected structural properties of optimal chromosomes
for constrained optimization: the initial population can include feasible solutions,
only.
2.7. Genetic operators. Crossover and mutation.

- ensure the exploration and the creation of new solutions
They maintain the diversity of the population
Without genetic operators, the best solution of the initial population would be the
result of the algorithm
Crossover - it acts on two parents in order to produce two offspring.

It interchanges sub-chains randomly selected from the parents.
Mutation usually applied after crossover.

It changes some randomly selected genes.
22
What operator is more suitable?

What is the best probability? no general answer available, no firm winner.
Notations: pc = crossover probability, pm = mutation probability.
Different genetic operators have been suggested - no available rules for choosing
the most suitable one.
Kursawe the genetic operator must be designed taking into account the
dimension of the search space.
Combine crossover and mutation.
Even for simple problems, the use of a single operator (crossover or

mutation) can lead to unsatisfactory results.
Some values recommended for crossover and mutation probabilities have been achieved by
means of experimental research.
GA: pc>>pm. Evolutionary strategies: use mutation only, or pm >>pc
A. Genetic operators for binary encoding
Crossover (for GA)

Values suggested for (via experimental research) pc:
- some authors recommend pc 0.6 ,
- other authors recommend pc (0.75,0.95) .
Usually pc >> pm .
(for evolutionary strategies: usually pc = 0)
23
Types of crossovers:
single cutting point crossover
multiple cut points crossover
- more efficient for exploration;
- random selection of the cutting points + others methods (e.g. avoid interchanging
identical sub-chains).
crossover using multiple parents: more than two parents participate to the production
of an offspring.
discrete crossover: 50% probability to select the gene from a certain parent, 50%
from the other parent.
Most popular ones: multiple cutting point crossover and discrete crossover
Mutation
- keeps the diversity of the offspring avoids the stagnation of the algorithm.
pm must be correlated with the employed selection.
Usually:
GA: pm small (rare mutation).
Large pm can disturb the algorithm convergence.
Example: If all the offspring survive implicitly to the next generation, the use of
pm > 1 / l ( l = the length of the chromosome) can lead to instability.
Recommendation - Bck (1996) for binary encoding: use Gray encoding with pm = 1 / l ,
with l = the length of the chromosome.
24
Maintaining constant pm throughout the evolutionary loop is not compulsory:
Decreasing pm.
Large pm. in the first generations in order to refresh the genetic material.
Small pm. in the last generation in order to allow algorithm convergence.
B. Genetic operators for float encoding
The encoding alphabet is infinite.

The length of the chromosome = the number of decision variables
Most popular operators proposed by Michalewicz (1996).
Crossover
simple crossover interchanges sub-chains randomly selected from the parents.
heuristic crossover if parent x2 is better than parent x1 ,
x1' = a ( x2 x1 ) + x2 ; a (0,1) random, scalar
discrete crossover see binary encoding
25
intermediary crossover - parents x1 and x2 , produce the offspring x1' and x2' :
x1' i = ai x1i + (1 ai ) x2i
x2' i = (1 ai ) x1i + ax2 i
a = [ai ]i - vector of random values - having the same size with the chromosome; its
elements can be chosen from ( 0.25,125
. ).
th
x ji , j {1,2} indicates the i element of the chromosome x j .
Gene 1
x1
Area where
offspring
could be
placed
x2
- the offspring are placed in a hypercube

slightly larger than the one delimitated by
the parents
Gene2
linear crossover it produces x1' , x2' from x1 , x2

x1' = ax1 + (1 a) x2
x2' = (1 a ) x1 + ax2
, a - scalar chosen within ( 0.25,125

. )
(*)
the offspring are placed on a segment slightly larger than the one delimitated by the parents:
Gene 1
segment where
offspring could be
placed
x1
x2
Gene2
simple arithmetic crossover it changes a single gene according to (*)

Most popular: linear and intermediary crossover
26
Mutation
Uniform mutation - changes the values of some randomly selected genes
The chromosome x = (v1 , v2 ,......., vk ,......vn ) is changed to x ' = (v1 , v2 ,......., v ' k ,......vn ) ,
when the mutation acts on vk.
The new value vk, is randomly chosen: vk' (vk a, vk + a), a > 0 .
This operator is very useful for populations containing multiple duplicates of the
same individual.
Non-uniform mutation acts differently at distinct generations.

v + (t , u v ), pentru r = 0
k
k
v 'k =
, with:
vk (t , vk l ), pentru r = 1
r random bit (uniform distribution of 0 and 1);
(u,l) the range of vk;
t the current generation;
(t , y ) decreased at subsequent generations according to:
t
(1 )b
T
y (1 z
)
(t , y) =
, z ( 0,1) ; b N (usually b = 5 ),
T maximum number of generations.
This mutation has larger impact in the first generations, when the genes can
be mutated in larger intervals ( (t , y ) is bigger). During the last generations,
only small variations are allowed.
27
C. Genetic operators for chromosomes with variable length

Vector-based chromosomes with variable length are used in (Goldberg, 1994) for
solving scheduling problems.
Genetic operators recommended for vector based chromosomes:
concatenation: two parents are concatenated to form an offspring.
splitting: it splits a chromosome in two offspring.

mutation it acts similarly to previously described mutations.
The chromosomes of variable length can have more complex structures (e.g. trees,
graphs). In these cases, distinct specialized genetic operators are used.
D. Self - adaptive operators

The technique is imported from evolutionary strategies:
o The chromosome encodes some parameters of the operator.
Very useful for GA.
Mutation
mutation with adaptive pm (Smith, Fogarty) - pm is encoded in the chromosome.
28
Crossover
Usually, they are aimed at finding suitable cutting points.
adaptive multiple cutting point crossover introduced by Schaffer and Morishima - it

uses an adaptive distribution of cutting points; the cutting points are encoded at the
end of the chromosome.
Eliminating an individuals means eliminating the encoded cutting points
segmented crossover uses variable number of cutting points; the chromosome
encodes the probability to cut the chromosomes in a specific locus.
dual crossover introduced by Spears - the chromosome encodes an indicator which
specifies the type of the crossover to be applied (e.g. discrete or multiple).
2.7. Selection for recombination

- it selects the individuals of the current population which participate in the genetic
recombination; these individuals deserve to produce, offspring as their genetic material
is valuable.
Selection is also used for insertion. However, insertion will be separately analyzed, as
the specific mechanisms are enough different.
Attention should be paid to the fact that the GA works on a finite population, for a finite
number of generations.
29
Selection includes two main stages:

1) Compute the selection probabilities usually equal with fitness values.
2) Chose the parents (sampling), according to the selection probabilities
- some authors consider selection sampling
.
objective
values
for all the

individuals of
the current
population
STEP I
fitness
compute the
selection
probabilities
(selection
probablities)
STEP II
parents
sampling
Recombination
pool
Selection for recombination
Stage I. Fitness values computation

Objective function provides an absolute assessment of an individual.
Selection must evaluate each individual relative to its competitors. This evaluation is
provided by fitness function.
There are two main alternatives for computing the fitness values:
By explicitly using the objective values,
By considering the rank of the individual assigned in list sorted in terms of the
objective values.
30
1. Fitness assignment by means of an explicit use of the objective values (scaling)

F ( x) = g ( f ( x)) ,
where
F is the fitness function,
f is the objective function,
g describes the mapping from objective values to fitness values.
Fitness values are equal with the probability of selection:

F ( xi ) =
f ( xi )
f ( xi )
, cu F ( xi ) = 1
i =1
with F ( xi ) = pi .
i =1
where
xi denotes the individual i of a population including N solutions,
f is the objective function.
Scaling-based fitness assignment encourages the individuals with objective values

better than average.
Requirements: f can take positive values, only.

Usually, a preliminary scaling of the objective function (f f* ) is considered.
Aims:
met the requirement above;
change the influence of the individuals during the evolutionary loop
F ( xi ) =
f * ( xi )
N
f ( xi )
i =1
xi denotes the individual i in a population of size N

f* scaled objective function.
31
A classification of the most common scaling methods is presented by Goldberg (1994)

and Michalewicz (1996):
linear scaling provides a linear transformation from f to f*:
f * ( xi ) = a f ( xi ) + b; a, b R , where xi denotes an individual of the population
The parameters a and b influence the relative quality of the individuals,
therefore they have impact on the convergence speed and the exploration
capability of the algorithm.
Usually, f* does not significantly alter the average value, whilst increasing the
influence of the individuals better than average.
For maximization problems, a must be positive; for minimization problems a
must be negative. b ensures that f* is non-negative (Michalewicz, 1996).
a and b are constant during the evolutionary loop.
Example:
if a = 1 and b is significantly bigger than the mean of f,
then f* converge more hardly than f.
Let us consider 10 individuals, mean( f ) = 5, b = 100; in a maximization

problem, for f ( x1 ) = 25 it results:
25
f (x )
p1 = 10 1 =
= 0.5 - without scaling
f ( xi ) 50
i =1
p1 =
f * ( x1 )
10
f * ( xi )
125
0,11 - with scaling
50 + 1000
i =1
f * ( x1 ) = f ( x1 ) + b = 125 ,
10
10
i =1
i =1
( f ( xi ) + b) = f ( xi ) + 10b = 50 + 1000 = 1050 ,
the mean becomes 105.
32
f * ( xi ) = ( f ( xi )) k , k R , where xi represent an individual of the population.

The individuals having objective values bigger than 1 gain increased impact,
whilst those with objective values smaller than 1 are disadvantaged.
k is set slightly bigger than 1 (e.g. 1.005).
Remark:
For some authors, f* is used as fitness, as the proportional scaling is implicitly provided
by the stochastic roulette based sampling.
2. Ranking based fitness assignment
o Scaling based fitness assignment gives very much credit to the individuals
considerably better than average. If a population includes an individual which is
significantly more adapted than the others, this individual will be selected with
more copies within the recombination pool, the offspring will be close to him, so
this individual has huge chances to conquer the whole population with its
duplicates. This impedes the exploration in larger areas and allows the algorithm
to remain locked in local optima.
This disadvantage is eliminated by ranking-based fitness assignment.
The population is sorted subject to the objective values.

The rank represents the position that an individual has in this sorted list (rank 1 for the
best chromosome, rank N for the worst one).
33
The fitness values can be computed as follows:

linear method
F (ri ) = q (ri 1) r ,
where ri is the rank of the individual xi, and q, r are parameters
The selection probabilities belong to an arithmetic series (with the step r).
To ensure that the sum of all selection probabilities is equal to 1, it results:
N 1 1
q=r
+ , where N is the size of the population.
2
N
When r = 0 (and q = 1 / N ), all the individuals get the same fitness, no matter what
performances they have.
(and q = 2 / N), the biggest difference is made between the
When
individuals placed on consecutive ranks. The worst chromosome is assigned with the
fitness value 0 and the best with the fitness value q = 2 / N .
It results that the range of q is (1 / N ,2 / N ) .
Rephrasing suggested by Baker:
q=
2( SP 1)
SP
and r =
,
N ( N 1)
N
where SP (1,2) is the selection pressure, N is the size of the population.

For SP = 1 all the individuals gain the same fitness.
For SP = 2 the best individual receives the biggest selection probability.
Baker recommends SP=1.1

Advantage: a single parameter (SP) is used.
34
selection
probability
rank
q ( N 1)r
Remark:
For some authors, the fitness values are not equal to the probabilities of selection,
N
yet proportional. Therefore, in these cases the requirement F ( xi ) = 1 is not

i =1
exploited, being solved at sampling.
nonlinear method
F (ri ) = q (1 q ) ri 1 ,
where ri denotes the rank of xi and q (0,1) .
The selection probabilities belong to a geometric series of rate (1 q) .
N
For any q (0,1) , the requirement F ( xi ) = 1 cannot be met. However, this sum
i =1
results close to 1 for a large N.

N
F ( xi ) =1 (1 q ) < 1; F ( xi ) 1
i =1
i =1
Usually, q is chosen close to 0 (e.g. Michalewicz recommends q = 0.04).
35
The advantages of rank based fitness assignment:

it does not need a preliminary scaling of f ;
the selection probabilities can be directly controlled by means of q and r;
larger ranges of genetic algorithm behaviors could be obtained.
Disadvantages:
it does not met the requirements of schema theory (analyzing the GA
convergence);
it neglects the differences between the objective functions of the individuals with
consecutive ranks;
it requests to apriorically set two parameters (q, r) or one parameter (SP) of huge
impact.
Stage II. Sampling the individuals for the recombination pool

Sampling methods can be analyzed in terms of three indicators introduced by Baker
o Efficiency is usually related to the computational complexity of the method.
o Bias = absolute difference between expected selection probability (usually equal

to the fitness value) and the selection probability considered by the sampling
method.
The bias defines the accuracy of the sampling.
Desired: bias = 0.
36
o The spread indicates the range of the number of selections allowed for an
individual: [min_no_samples max _no_samples]
It measures the consistency of the method.
Small spread means that the real number of selected samples is close to
the expected number of samples.
Please note that a finite number of selection trials are considered. For
infinite trials, the number of occurrences would be compliant with the
selection probabilities; however for small number of trials, huge
differences can be met.
Roulette based method. Stochastic sampling with replacement
S UM
F(x 1 )
F(x 1 )+ F(x 2 )+ .... + F(xN - 1 )
.
.
.
F( x1 )+ F(x2 )
F(x 1 )+F(x 2 )+ F(x 3 )
Notations: xi the ith individual of the population, i=1,..N

F(xi) the fitness value of xi
37
Explanations:
Each individual gains a sector proportional to his fitness. The population forms a
circle with the length SUM (usually, SUM = 1 , although this requirement is not
mandatory).
Nsel random numbers are generated within (0, SUM), if Nsel indicates the number
of individuals needed in the recombination pool. Each selection corresponds to a
roulette turning; the individual sent to the recombination pool is the one indicated
by the position of the needle.
Obviously, a higher fitness value (assigned to a well adapted individual) conducts to a
higher sector and consequently to higher selection chances.
The method ensures:
null bias
large spread (0,Nsel)
(any individual with non-null fitness can be selected)
computational complexity order Nsel ln(Nsel ) .
Stochastic sampling with partial replacement

The method is similar to previous one, but once an individual is selected, its sector is
decreased.
o This decreasing corresponds to SUM / Nsel (for fitness), which means reducing
with 1 the number of future occurrences.
o Starting with the second trial, the needle will not be allowed to rotate around the
whole circle.
o By successive reductions, a sector can result of negative length in this case it is
eliminated from the roulette.
The method ensures:
- null bias
- smaller spread the maximum number of samples is [ F ( xi ) Nsel / SUM ] + 1,
where [x] indicates the integer part of x.
38
Remainder stochastic sampling

These methods include two main stages:
First stage - deterministic selection provided in terms of the integer part of the
expected number of selections.
Second stage stochastic - the rest of the samples are chosen by means roulettebased sampling.
Two methods are presented they employ distinct techniques during the second stage.
o Remainder stochastic sampling with replacement uses stochastic sampling
with replacing for the second stage: the sectors remain unchanged.
The method ensures: null bias, minimum number of samples [ F ( xi ) Nsel / SUM ] .
o Remainder stochastic sampling without replacement uses stochastic sampling

with partial replacement during the second stage: once an individual is selected,
its sector is eliminated
The method ensures: bias close to 0, very small spread.
Stochastic universal sampling

A single random number p is generated within (0,SUM/Nsel).
Then, the other individuals are uniformly distributed, corresponding to
p, p + SUM / Nsel ,...... p + ( Nsel 1) SUM / Nsel .
The method ensures:

o small spread,
o null bias,
o small computational complexity order (Nsel)
39
Other selection methods:

K-tournament selection
A set of k individuals is randomly formed. The best of them is placed in the
recombination pool.
For Nsel samplings, the above procedure is repeated for Nsel times.
Large k means large selection pressure.
Usually k = 2.
Additionally, a Boltzman mechanism can be used (Michalewicz, 1996):
If x competes with v and if x is better than v, sometimes the selection of v is
also allowed:
Generate a random number between 0 and 1.
F (v) F ( x)
T
, v wins the contest; otherwise x is the winner.
It is smaller than e
T is float number decreased during the evolutionary loop,
F(v) and F(x) are the fitness values.
Timeover
Let us consider a GA without genetic operators, which uses the selection for
recombination in order to directly form the population of the next generation. The size
of the population is kept constant.
The solutions better than average of the initial population are progressively
sampled with more and more copies.
After a finite number of generations the best individual conquers the whole
population with its duplicates and the algorithm stops.
This number of generations is called timeover.
40
Bck illustrates two types of selections in relation to timeover:
Soft selection, based on small selection pressure large timeover:
o The best solutions are not extremely encouraged.

o The algorithm ensures a very good exploration, the population is diverse
for a large number of generations (Kuo and Hwang, 1996).
Hard selection, based on big selection pressure small timeover
o The best solutions are extremely encouraged.

o The population loses the diversity at preliminary generations, so the
exploration is focused on certain small areas.
o The convergence speed is very high.
Bck (1996) compared the selection based on scaling with rank-based selection:
Sampling is solved with a roulette-based method.
Scaling-based selections with linear, polynomial or exponentiation scaling lead
to a timeover of order N ln(N ) , where N is the size of the population.
The type of scaling has no huge influence.
For rank based selection, the timeover is very sensitive to SP.

The timeover is monotonically decreasing in terms of SP.
(small SP means soft selection).
Bck recommends:
rank-based selection and k-tournament selections
41
Classification of selections - Bck and Hoffmeister + Michalewicz (1996).

Considering the dynamics of the selection probabilities:
static selections - same set of selection probabilities is used at every generation;
dynamic selection the set of selection probabilities changes from a generation to
another.
Considering the minimum selection probability:
non-conservative selections accept individuals with null selection probability:

with left elimination no chances to select the best solutions (the goal is to
avoid premature convergence);
with right elimination no chance to select the worst solutions.
conservative selections null selection probabilities are not accepted.
2.8. Insertion (selection for survival)

Offspring insertion = a selection process.
One selects the survival offspring and old solutions
A: Insertion for fixed length populations

- the size of the population (N) is maintained constant during the evolutionary loop.
o The selection which ensures the survival of the best solution at every generation
is called elitist.
- for this type of selection, Rudolph proved the convergence towards the
optimal point.
42
A. 1. Methods specific to GA
- in order to reduce memory and computational time consumption: GA produces
fewer offspring than the population size.
- some offspring (the best ones) are inserted into the population.
size of the recombination pool/ size of the population = generation gap
it indicates the informational gain of the algorithm per generation .
The new information is achieved by exploiting the most valuable genetic material
inherited during the evolutionary loop.
- offspring are inserted at each generation, = constant.
o If {1,2} , the selection is called steady-state.

o If N offspring replace all N current solutions, the selection is called pure.
- each individual lives for a single generation.
Usually, offspring replace the worst solutions of the current population.
The replacement can be deterministic or stochastic (roulette-based), using inverse

fitness selection.
Inverse fitness selection means that the worst solutions of the current
population are replaced; this is computationally advantageous, because < N.
Fogarty proved that deterministic and stochastic insertions lead to similar
convergence speeds. Replacing the worst individuals means an elitist
selection.
Therefore, for the sake of simplicity and time performances improvement,
insertion is deterministic.
43
Other insertions consist in the replacement of oldest solutions of the current

populations.
As the best individuals have numerous duplicates created at successive
generations, a well adapted individual can survive via its newest copies.
Insertion can act

- once in a generation (generational insertion) - after the offspring are generated;
- on-fly- an offspring is introduced into the population immediately after its
generation; this means that an offspring can replace another offspring obtained at
the same generation.
Other insertions are based on similitude analysis:

Crowding insertion introduces an offspring by eliminating the most similar individual
of the current population. This insertion is useful for multimodal optimizations.
o preselection (proposed by Cavicchio): an offspring replaces the most similar parent
multiple species can coexist within the same population;
increased diversity.
o overcrowding (proposed by De Jong) an offspring replaces a similar solution of the current
population.
A set of k individuals is randomly selected form the current population. The most similar is
replaced by the offspring.
The similitude between two solutions is measured by means of Euclidian distance

(float encoding) or Hamming distance (binary encoding).
44
All the above described insertions are (, N), N .
This means that every generation offspring are inserted in the population by
eliminating parents.
N individuals of the current population survive to the next generation.
For = N all the offspring are inserted, no old solution survives.
A.2. Insertions imported from evolutionary strategies

These insertions were firstly used for evolutionary strategies and then imported to
genetic algorithms.
Types:
(, N),
(N+)
Remark:
Schwefel recommends the (, N) insertion with >> N , however the (N+) insertion
has also proved its efficiency in numerous applications.
45
A.2.1 (, N) insertion with >> N

A huge number of offspring is produced at each generation.
Next population is formed with the best N offspring.
o Usually, this insertion is deterministic.

o However this insertion is not elitist, as the best offspring can be worse than the
best current solution.
o The insertion is static, non conservative (some eliminations accepted).
o This insertion is very useful for the optimization of time variant, noisy
functions.
o Bck proved that this selection ensures higher selection pressures than ktournament or ranking selections.
A.2.2 (N+) Insertion
offspring are generated. For evolutionary strategies >> N , although the method can
be applied for N .
An intermediary population (having (N+) size) is formed, by reuniting the old
solutions and the offspring. Then, its best N individuals deterministically survive to the
next generation.
o The insertion is elitist the performances of the best solution are monotonically
improved.
o The survivors can be also stochastically selected from the intermediary
population, using k-tournament selection. Usually, each selected solution is
extracted (eliminated from the population).
Small values of k are preferred.
For large k, the selection is close to the deterministic one.
o The method ensures high convergence speed. It can be used in combination
with techniques able to preserve high diversity within the population.
46
B .Insertion for algorithms evolving a variable sized population

The size of the population influences the accuracy and the convergence speed of the
algorithm:
The algorithms working on small populations do not ensure an intense

exploration of S and can lock in non-optimal points, although their
convergence speed can be good.
The algorithms working on large populations have an explorative behavior, at
the cost of high computational resource consumption.
These problems can be addressed by means of variable sized populations.
When created, each individuals is assigned with a life time:

Every generation the life time is decremented; when the life time becomes 0, the
individual is eliminated form the population (it dies).
o Insertion is implicitly solved.

o Should all the individuals be assigned with life time bigger than 1, the size
of the population increases exponentially.
Life time should be assigned by taking into account the performances of each individual
relative to the performances of the individuals included in the current population and/or
in the previous populations.
Better individuals should live for longer time intervals, thus having higher
chances to produce offspring inheriting their genetic material.
47
E.g: linear allocation:

abs
F ( xi ) Fmin
Life time = m + ( M m) abs
,
abs
Fmax Fmin
where
m and M represent the minimum and maximum life times, respectively,
F(xi) denotes the fitness value of xi individual
abs
abs
and Fmax
indicate the minimum and the maximum fitness values obtained from the beginning of
Fmin
the evolutionary loop.
Remark:
o It could be useful to use larger populations at the beginning of the algorithm (in
order to encourage the exploration), and smaller populations during the last
generations (when exploration is merely guided around some solutions).
2.9. GA convergence
The theory of GA is not capable to entirely explain the involved mechanisms.
GAs are very good, but we do not exactly know why they are so good.
Optimization theory states that an algorithm converges toward the

global optimum if it generates a sequence of solutions having as
limit the global optimum.
o The convergence has been proved for particular GAs, for unrealistic assumptions,
such as infinite populations, infinite number of generations.
48
2.9.1. Schema theory

Assumptions: chromosomes of size l; finite encoding alphabet of size k.
!!! For the sake of simplicity, the binary encoding is considered, meaning the
encoding alphabet is {0, 1} (size 2).
o Each gene/locus illustrates a distinct feature of the individual. The evolutionary

process determines the features of the best adapted solutions (e.g. almost well
adapted individuals have the first gene 0 and the last gene 1).
o This implicitly indicates a favorable searching direction within the phenotypic

space and a schema (building block) in the genotypic space.
Schema = a structural template describing the genotypic similarities of the individuals.
o Each schema contains constant and variable genes. For the previous example, the
schema is 0###.....##1. Here, # indicates the variable genes for which any allele is
permitted (in this case 0 or 1).
Therefore, the search can be viewed as the process which looking for the best adapted
schemata.
Holland stated that the fitness value of an individual gives partial
information about the adaptation capacity of the schemata belonging to
the individual.
Rephrasing, the fitness of schema H can be computed as the mean fitness
of the individuals containing instances of H.
49
A schema is characterized by two parameters:

The order , ( H ) = the number of constant genes
- e.g. schema 01**10*1 has the order 5.
The length, ( H ) = the length of the chain delimitated by the first and the last
constant genes -1
- e.g., schema 01**10*1 has the length 8 1 = 7 .
Schema Theorem: GA with linear scaling-based selection, simple crossover and rare
mutation encourages the multiplication of schemata better adapted than average, having
small lengths and small orders:
m( H , t + 1) m( H , t )
f (H )
1 N
f ( xi )
N i =1
[1 pc
(H )
l 1
( H ) pm ] , with
m( H , t ) - the number of H instances at generation t

m( H , t + 1) - the number of H instances at generation t +1
N the size of the population,
( H ) - the order of H;
( H ) - the length of H;
f ( H ) - the fitness of H, computed as the average fitness of all the individuals contained by
P(t) which comprise instances of H;
1 N
f ( xi ) - mean fitness of the individuals belonging to P(t);
N i =1
pc and pm - the crossover and mutation probabilities.
50
Proof
o Let us consider the scaling-based selection applied on P(t), in order to fill the
recombination pool with N samples.
o The expected number of selected samples for the individual xi having the fitness
f ( xi )
.
f ( xi ) : ni =
1 N
f ( xi )
N i =1
o After selection, within the recombination pool, the number of Hs instances is:
f (H )
m( H , t + 1) s = m( H , t )
, with
1 N
f ( xi )
N i =1
f (H ) - the fitness of H computed for P(t),
1 N
f ( xi ) - the mean fitness of P(t).
N i=1
This eq. indicates that the GA encourages the multiplication of schemata which are
better adapted than average.
o Afterwards, crossovers and mutations are applied with probabilities pc and pm ,

respectively. The resulted offspring will be all inserted in the population of the
next generation.
o The probability of destroying an instance of H (contained by a parent) by means

(H )
of crossover is
, so the instance survival probability results
l 1
(H )
psc 1 pc
.
l 1
Crossover encourages the survival of shortest schemata.
o Mutation can also destroy some instances of H.

The survival probability for H s instances is
p sm = (1 pm ) ( H ) 1 pm ( H ) .
Mutation encourages the schemata with lower orders.
51
o Considering an encoding alphabet of size k and chromosomes of length l,

(1 + k ) l different schemata can be produced.
Proof: A schema represents a string of size l, formed with any character of the
encoding alphabet and the character #.
o A chromosome of size l instantiates 2l schemata, because each gene can be

interpreted as a constant value or as #. Therefore, a chromosomal chain of size l
ensures the existence of 2l schemata.
o For a population of N chromosomes (each one having l genes), the number of

instantiated schemata is between 2l and N 2 l schemata, because distinct
chromosomes can also contain some common schemata.
o For a finite population, some schemata can have no instances - the best ones tend
to conquer the population.
o Even a small population contains rich information concerning the similarities of

the individuals.
o Holland proved that the number of the schemata which are efficiently processed
by GA at a certain generation are about N 3 , where N denotes the size of the
population.
Bertoni and Dorigo argued that Holland estimator is valid for populations
having the size proportional with 2l .
o However, GA can implicitly analyze significantly more schemata than the number
of the individual;: this behavior is called implicit parallelism.
52
2.9.2. Banachs Theorem regarding GA convergence
o Using Banachs theorem, a useful result concerning the GA convergence has been
obtained:
A GA which is capable to improve the mean performances of its
population at any successive generations converges towards a fix
population (fix point).
Therefore, for any initial population, after an infinite number of
generations, a final fix population is obtained this population includes
optimal solutions, only.
Remarks:
o The theorem does not give any result concerning the convergence speed of the
algorithm.
Obviously, the convergence speed is influenced by the algorithm parameter:
(the size of the population, genetic operators probabilities, etc.) and by the
content of the initial population.
In real implementations, the number of generations should be finite, too.
o The theorem requests the improvement of the mean performances of the

population. Therefore, we can validate the GA going to the next generation only
if this requirement is met. This can also involve reiterating the evolutionary
process at certain generations, until a better population is found.
53
2.9.3. Other results concerning GA convergence

Rudolph proved that a GA which performs the scaling-based selection of N parents,
and uses crossover and mutation, does not necessarily converge toward the global
optimum.
Even the expected number of optimal solutions tends towards values greater than
1, the convergence towards the global optimum is not assured. The explanation is
related to the fact that the probability of loosing these points is not zero.
Therefore, the Schema theorem does not guarantee the convergence toward the
global optimum.
Rudolph proves that a GA with elitist selection (which keeps the best solution within
the population) convergences towards the global optimum.
The requirement does not refer to the improvement of mean performances of the
population, yet to the survival of the best adapted individual, only.
Insightful explanations concerning the influence of selection and genetic operation were
delivered by Qi and Palmieri Let us consider a GA working on infinite populations for
optimizing bounded, positive, unimodal objective functions with a finite number of
discontinuities.
o If the initial population covers (continuously) the whole exploration space, the
scaling-based selection will encourage the individuals clustering towards the
regions characterized by the highest fitness values. The density of the solutions is
increased around the optimum point.
o So, the use of selection without genetic operators guarantees the convergence
towards the global optimum.
o This convergence is also proved for GAs working on infinite populations (which
continuously cover the search space) with scaling-based selection and mutation of
low magnitude or rare.
54
When working with finite populations, the initial population does not include all
the potential solutions of the exploration space, so the action of genetic operator
is crucial for refreshing the genetic material.
Also note that GAs involve a finite number of generations, so the convergence
speed is vital for the algorithm performances.
This convergence speed is dependent to all algorithm parameters.
2.10. Parallel GA
o Because GAs are time consuming, they are usually employed for offline
applications. The execution time depends on the size of the population, the
selection pressure, etc.
o Using smaller populations can lead to smaller execution time, at the cost of
reduced accuracy.
o A more valuable approach for reducing the execution time without altering the
other algorithm performances is to consider parallel implementations.
Three main directions could be depicted: global GAs, migration-based GAs and
diffused GAs.
55
2.10. 1. Global GAs

These approaches exploit the fact that some GAs stages can be carried out in parallel on
different individuals or pairs of individuals.
MASTER
SLAVE 1
...............
SLAVE k
Master-slave architecture for global GAs
Example of master- slave architecture:
o master for population initialization, fitness computation, selection and general

control of the population.
o slaves for crossover, mutation, offspring evaluation.
Other parallel implementations can be considered

- e.g. using systolic approaches.
.
56
2.10.2. Migration based GAs (GAs with migration)
The population is divided in several subpopulations of equal sizes, which evolves

independently for certain generations.
Periodically, some individuals are exchanged between the subpopulations.
One must indicate:

- when migration is allowed;
- the ratio of individuals which migrate (r% individuals of the subpopulation);
- which individuals migrate;
- which subpopulations interchange information.
initialization:
t=0; chose N random individuals to form P(t);
repeat until t < No_Generations +1
for each subpopulation SbP(t) execute separately:
step 1: evaluate SbP(t)
step 2: selection fill the recombination pool of the subpopulation;
step 3: crossover- generate offspring using the parents selected at step 2;
step 4: mutation apply small variations on the individuals obtained at step 3;
step 5: evaluate the offspring resulted at step 4;
step 6 : insertion - create SbP(t+1), choosing N individuals form SbP(t) and from the
offspring obtained at step 5;
if migration is allowed:
step 1: chose r% individuals from each subpopulation (the best ones) - for migration;
step 2: establish the content of the subpopulations for the next generation, eliminating
the less adapted host individuals;
t=t+1;
end of the loop
display the best individual of the entire population;
end of the algorithm
57
Types of communications between the subpopulations

neighborhood migration
ring migration
1
1
5
unrestricted migration
1
5
The implementations lead to good results if the best individuals of each subpopulation
are encouraged to migrate.
Within each subpopulation, the best adapted individuals have multiple

instances.
During migration, the worst individuals of the host subpopulation are replaced
by well adapted solutions coming from other subpopulations; therefore, each
subpopulation benefits from the experience of the other ones.
The emigrants can be also chosen from the offspring more offspring are produce at the
generations which involve migration.
Some offspring migrate to other subpopulations. Because they combine the
genetic material of well adapted individuals, their genotype can be valuable for
the host subpopulation.
58
GAs with migration lead to reduced execution time.

GAs with migration lead to better accuracy.
!!!!! Usually, the GA working on a population having the size equal to the sum of
subpopulations sizes has worse results than the migration-based GA.
GA with migration is suitable for multimodal optimizations - each subpopulation can

evolve to a distinct optimum point.
2.10.3. Diffused GAs (neighborhood based, with fine granularity)

Unlike the island model which establishes rigid boundaries between the isles, here the
population is treated as a whole.
Some constraints concerning the recombination of the individuals are

imposed. The mate can be a neighbor, only.
The recombination is carried out as follows: each node received copies of its
neighbors and sends copies to them. One of the parents is the individual
encoded in the node. The second parent is chosen from the received
duplicates. A single offspring is produced and it competes with the individual
of the node. One can also allow the implicit survival of the offspring, despite
its fitness value.
59
The initial population is random, uniformly distributed over the exploration space.
After several generations, some clusters could be observed, indicating regions where the
nodes contain similar individuals
Better adapted individuals tend to be spread over the population, thus
conquering a larger surface.
- this GA uses a local selection, in compliance with the natural model;
2.11. Benchmarks for GA evaluation

Used for empirical analysis
!!!!!!! There is no objective function which permits the generalization of the analysis
concerning its optimization.
Usually, the benchmarks are less complex than engineering industrial applications; they
contain
unimodal and multimodal objective function;
non-derivable functions.
The functions should be scalable: the complexity of the optimization problem should be
tuned via some parameters.
A benchmark for constrained optimization can be found in [Michalewicz].
60
A good benchmark has been proposed by Bck:

quadratic function:
n
f1 ( x) = xi2 ; x = [ x1 ..... xn ];
i =1
Unimodal function (admitting a single optimum point); usually n = 2 .

stair function resulted by the discretisation of the continuous squared in terms of
its output values.
Discontinuities, multiple optimum local points.
Ackley:
f 3 ( x) = c1 e
c 2
1 n 2
xi
n
i =1
1 n
cos( c3 xi )
n
i =1
+ c1 + e; x = [ x1 ..... xn ];
usually: c1 = 20; c2 = 0.2; c3 = 2 ; n < 30; xi [ 20;30]

Multimodal function.
Fletcher & Powell:

n
f 4 ( x) = ( Ai Bi ) 2 ; x = [ x1 ..... xn ];
i =1
j =1
j =1
Ai = (aij sin j + bij cos j ), Bi = (aij sin x j + bij cos x j )

aij , bij (100,100); j ( , ); xi ( , ); n < 30
Multimodal function, non-symmetric.Very complex optimization

fractal function
n
f 5 ( x) = (C ' ( xi ) + xi2 1); x = [ x1 ..... x n ];

i =1
C ( xi )
, pentru xi 0
j
2 D
1 cos(b x )
'
C
(
1
)
x
i
C ( xi ) =
, C ( xi ) =
j =
b ( 2 D ) j
1, pentru xi = 0
D = 1.85; b = 1.5; n < 20; xi (5,5)

Non-derivable function. Very complex optimization.
61
CHAPTER 3. ARTIFICIAL NEURAL NETWORKS

3.1. Artificial neuron
= the basic computational unit of unit artificial neural networks (ANN)
p1 ,...., p R - neuron inputs;
p1
...
pj
w1
wj
....
n
f (n)
f : , y = f (n ) - activation function
usually nonlinear;
b bias other usual notation ;
wR
w1 , w2 ,.....wR weights of incomming links.
pR
Components:
- synapses or links characterized by weights (also called strengths);
- summing block and activation function (the activation function is usually nonlinear);
Input-output mapping: static model
The output of the model is computed as follows:

y = f (n) ,
where the input of the activation function, called activation (n) is:
p1
n = w1 p1 + ..... + wR p R + b 1 = [ w1 ... wR ] ... + b = Wp + b ,
p R
or
R
n = wi pi + b
i =1
with b , W 1 x R , p R x 1 .
62
b can be treated as a supplementary weight:

w0 = b for the input p0 = 1 :
R
~~
n = wi pi = Wp
i =0
notations: [w0
[1
not
~
w1 K wR ] = W (transposed) extended weight vector
not
p1 K
~T
pR ] = p
(transposed) extended input vector
So, the diagram can be redrawn:

p0
=1
w0
=b
p1
...
w1
wj
pj
....
n
f (n)
wR
pR
Usually : y [0,1] or y [1,1] .

Other alternative for the extended weight vector:
~
W = [ w1 .. wR
~ T = [ p ...
b] , p
1
63
pR
R +1
~~
n
=
wi pi = Wp .
1] ,
i =1
Comparison between the artificial neuron (AN) and the biological one (BN):
1) AN admits negative weights !!! (unlike BN).

positive weights for excitation effect ,
negative or null weights - for inhibitive effect .
2) time constants: BN 1msec, AN 1nsec

fewer links and fewer neurons for ANN
3) energetic efficiency: BN 1016 J / sec per operation, AN 106 J / sec per operation.
4) BNN works asynchronously, without a clock master (continuous time domain).
5) BNN involves random connectivity; ANN uses specified connectivity.
6) BNN are tolerant to errors.
Typical activation functions:

Deterministic functions
1, n 0
1) y = f (n) =
- hard limiter
0, n < 0
-b
with nn = wi pi
i =1
+1
+1
nn
-b
0
-1
Hard limiter
Symmetric hard limiter
64
nn
2) y = f (n) = n - linear
f
+1
nn
, with nn = wi pi
R
i =1
3) y = f (n) =
1
, c > 0 - sigmoid
1 + exp(cn)
For c = 1 :
f
+1
0.5
-b
nn
R
, with nn = wi pi
i =1
FUNCTIE DE ACTIVARE SIGMOID

w=2>0; b=-3<0
w=2>0; b=3>0
1
0.6
0.6
a
0.8
0.8
0.4
0.4
0.2
0.2
0
-5
1
0
-5
intrare 5
w=-2<0; b=3>0
0.6
0.6
intrare
w=-2<0; b=-3<0
0.8
0.8
0.4
0.4
0.2
0.2
0
-5
0
5
-5
0
5
intrare
intrare
in punctul de inflexiune: a=0.5, p = -w/b, tangenta la grafic are panta= w/4
0
65
4) y = f (n) =
1 exp(2cn)
, c > 0 - hyperbolic tangent
1 + exp(2cn)
For c = 1 :
f
+1
-b
0
-1
nn
, with nn = wi pi
i =1
5) y = f (n ) = e
( n c )2
- Gaussian function >> see RBF (other input-output mapping)

f
+1
3.2. ANN architectures
The ANN structure allows a parallel and distributed processing.
Each ANN can be represented by a directed graph.
The nodes of the graph correspond to neurons.
The nodes are connected by links which ensure unidirectional and instant
communication.
A processing unit (neuron) admits any number of input links.
A processing unit has local memory.
A processing unit can be modeled in terms of input output formalism.
The neurons are organized in layers.

Within a layer, the neurons are considered to work in parallel.
66
Hidden
layers
Input
layer
Output
layer
u1
y1
u2
y2
Legend:
Lateral links (between the
nodes of the same layer)
Feedback links (from the
output of a neuron to its
input)
um-1
yk-1
B ackward links (t o the neurons

of the previous layers)
um
yk
F eedforward li nks (to the

neurons of t he next layers)
AN N
outputs
ANN
inputs
Remark:
- the input layer does not perform any processing; it will be not count;
- the hidden layers and the output layer include neurons.
.
Types of ANN
feed-forward: - with feed-forward links, only

dynamic/recurrent: - contain at least one lateral, backward or feedback link
Example 1: Feed-forward architectures with one layer:

Neuro n 1
b1
o Only feed-forward links!!!

o Input layer: m inputs, no processing
o Output layer: k neurons, characterized by
yl = f l ( wl ,1u1 + .. + wl ,mum + bl ), l = 1, k
n1
w1,1
u1
y1
f 1( n )
w1, j
...
w1, m
uj
...
....
w k ,1
Neuro n k
um
wk , j
bk
wk , m
nk
f k (n )
Inp ut
layer
67
Ou tp u t l ayer - k ne uro ns
yk
Let us assume that all activation functions are identical.

One can write:
w1,1u1 + ..... + w1,m u m + b1
y1 f ( w1,1u1 + ..... + w1,m u m + b1 )
...
...
...

~~
)
y = y l = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( Wu

...
...
...
wk ,1u1 + ..... + wk ,m u m + bk
y k f ( wk ,1u1 + ..... + wk ,m u m + bk )
with
w1,1
..
~
W = wl ,1
...
wk ,1
..
w1, m
... ...
... wl , m
... ...
... wk , m
b1 extended weights for neuron 1

...
bl extended weights for neuron l
...
bk extended weights for neuron k
u1
...
~
u = = input vector,
u m

1
y1
y = ... = output vector.
y k
Remark:
~
W = [ wi, j ], i = 1, k , j = 1, m + 1
for wi, j - the first index indicates the neuron
- the second index indicates the link
68
= extended weights matrix,
Simplified diagram for feedforward ANNs with 1 layer
y1
u1
N euro n 1
...
uj
. ...
yk
um
N euro n k
O utp ut laye r
- k neuro ns
Inp ut
la ye r
Remark: The layers can be

Fully connected all feedforward links are used,
Partially connected some feedforward connections are missing.
Example2: Feed-forward architectures with two layers
b2
b1
w 11,1
n1
f 1(n )
y 11
w 12,1
f 2 (n )
1
u1
w 11, j
Neuron 1,
layer 1
...
w 11, m
u
n2
y 11
w2
...
...
....
w 1 s ,1
Neuron s,
layer 1
um
Neuron k,
layer 2
2
k ,1
s,j
b 2k
b1
s
w1 s ,m
n1
Inp ut
layer
Neuro n 1,
layer 2
1, s
y 1s
w k2,s
f s1 ( n )
Hidden layer - s neurons

- layer 1
y 2k
n k2
f k2 ( n )
Output layer - k neurons

- layer 2
Identical activation function within each layer!!!!
69
- Layer 1:
y1
w1 .. w1
1
1,1
1, m
~1
1
1
1 ~ 1~
y = f ( W u) , with y = ... , W = .. ..
..
y1
w1 .. w1
s, m
s,1
s sx1
-

b11
u1
~
, u = ..
..
u
m
b1s
sx ( m +1)
1 ( m +1) x1
Layer 2
y2
w12,1 .. w12, s
1

~ 1
~
y 2 = f 2 (W 2 ~
y ) , with y 2 = ... , W 2 = ..
..
..
y2
w 2 .. w 2
k,s
k ,1
k kx1
b12
,
..
2
bk
kx ( s +1)

y11
y1

~
=
y1 = ..
1
y1s

1 ( s +1) x1
Remark:
Upper index: the layer

Lower indexes: the first =the neuron; the second = the link of the neuron
y2
y1
u1
...
N e uro n 1,
la yer 1
N e uro n 1,
la yer 2
uj
. ...
um
Inp ut
la ye r
y 1s
N euro n s ,
laye r 1
H idd e n la yer
- s neuro ns
70
y2
k
Ne uro n k
layer 2
O u tp u t la ye r
- k ne uro ns
Example 3: Feed-forward architectures with two layers simplified diagram
y1
u1
Neuro n 1
...
q- 1
q- 1
uj
....
yk
um
Neuro n k
Inp ut
layer
ANN architecture =
Outp ut layer
- k neuro ns
Number of inputs and number of outputs

Number of layers
Number of neurons within each layer
Map of links
Type of the activation functions
ANN parameters
sigmoid/linear/step =
weights
biases
Gaussian
centers
spreads
71
3.3. Multi-layer Perceptron (MLP)

MLP architecture
b1
w11,1
u1
v1
f 1 (v )
v2
y11
w11, j
w 1 ,1
w2
...
...
w1 s,1
Neuron s,
Layer 1
um
Neuron k,
Layer 2
2
k ,1
s,j
bk
w1s, m
v1
Inp ut
layer
Neuro n 1,
La yer 2
1, s
uj
....
f ( v)
1
Neuron 1,
Layer 1
1
w 1, m
y11
...
2
1
y1s
wk2,s
1
f s ( v)
Hidden layer - s neurons

- layer 1
y 2k
v 2k
f k2 ( v )
Output layer - k neurons

- layer2
Characteristics:
o The layers are linked in series: the outputs of the neurons belonging to a layer are inputs for
the neurons of the next layer.
o Within a layer, the neurons work in parallel.
o All the neurons have sigmoidal activation function (linear, sigmoid, tanh)
o The MLP can have any number of hidden layers
72
Criteria for learning algorithms based on error correction

Let us consider k neurons within the output layer.
1. On-line
The training samples (u (i ), d (i )), i = 1, N are presented in sequel, one sample per iteration (the
number of iterations = multiple of N).
k
I (n) = 0.5 ei 2 (n) , with ei (n) = d i (n) yi (n) = the error of the ith output neuron
i =1
2. Batch
All the training samples (u i , d i ), i = 1, N are presented at a single iteration.
I (n) =
1 N k 2
ei (n, j ) ,with ei (n, j ) = the error of the ith output neuron for the jth training
2 N j =1 i =1
sample presented an the nth epoch
Backpropagation learning algorithm

= the steepest descent method (gradient)
wijl (n + 1) = wijl (n)
I
wijl
(n) = wijl (n) + wijl (n) ,
> 0 - influences the convergence speed
Overview steps carried out at each epoch:
Evaluate the ANN output and the error: feedforward INOUT
Adapt the parameters: backward OUTIN: (backpropagation)
73
2
- For online learning ( I (n) = 0.5 ei (n) )
i =1
Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
- For batch learning ( I (n) =
1 N k 2
ei (n, j ) ,)
2 N j =1 i =1
where ei (n, j ) = the error of the ith output neuron for the jth training sample presented at the nth
epoch
wikl (n) =
1 N
l
wik (n, j ) -the mean of variations separately computed for each sample
N j =1
Backpropagation adaptation equations

k
For the sake of simplicity, online learning is considered I (n) = 0.5 ei 2 (n)
i =1
1. For the output layer (denoted l)
Let us consider k output neurons and s neurons in the preceding layer.
ei (n) = d i (n) yil (n) , the error produced by the output neuron i
s
vil (n) = wijl y lj1 , with wil0 = bil i y 0l 1 = 1 .

j =0
I
wijl
( n) =
e
e
y l
v l
I
(n) i (n) = ei (n) i (n) i (n) i (n)
ei
wijl
yil
vil
wijl
74
for a certain sample
I
(n) = ei (n) (1) f ' i (vil (n)) y lj1 (n) = il (n) y lj1 (n)
l
wij
wijl (n) = il ( n) y lj1 ( n)
with
il = ei (n) f 'i (vil (n)) =
I
(n) = local gradient
vijl
Parameter variation
2. For the hidden layers
Problem: find the contribution of a hidden neuron to the total error.
The parameters will be adapted starting from the output layer to the input layer.
Considering the hidden layer l, the local gradients within the layers l +1, l +2, ..etc. must be
available from previous computations.
The output of the neuron i belonging to layer l is input for the neurons belonging to l +1 (output).
75
For simplicity: the layer l +1 is considered the output layer.
Layer l +1: with k neurons
y zl +1 (n) = f l +1 (v zl +1 ( n)) , z = 1, k ,
s
v lz+1 (n) = wlz+, j1 y lj (n) ,

j =0
s = the number of input connections for the neuron i (the number of neuron with the previous layer,
l ), wlz+,01 = bzl +1 , y lz = 1 .
zl +1 known for z = 1, k .
Layer l: with s neurons
yil (n) = f l (vil (n)) , z = 1, k ,
vil (n) = wil, j y lj1 (n) ,

j =0
q = the number of input connections for neuron i (the number of neurons belonging to the previuous
layer), wil,0 = bil , y 0l 1 = 1 .
If l is the first hidden layer ( l = 1 ), then y il 1 (n) = u i (n) .
76
k
e l +1
I
( n) =
( n) z ( n)
l +1
z =1 e z
wil, j
wil, j
k
y il
vil
e zl +1
y zl +1
v zl +1
I
l +1
n
e
n
n
n
n
n
(
)
=
(
)
(
)
(
)
(
)
(
)
( n)
z
wil, j
y zl +1
v zl +1
yil
vil
wil, j
z =1
k
I
(
n
)
e zl +1 ( n) (1) f l '+1 (v zl +1 ( n)) w zil +1 f l ' (v il (n)) y lj1 (n)
=
wil, j
z =1
k
k
I
'
l +1
l +1
l
l 1
l 1
(
n
)
w
f
(
v
(
n
))
y
(
n
)
y
(
)
zl +1 w zl +,i1 f l ' (v il ( n)) .
=
z
z ,i
l
i
j
j
wil, j
z =1
z =1
wil, j ( n) = y il 1 ( n) il ( n) ,
with
il (n) = zl +1 w lz+, i1 f l' (v1i (n)) = local gradient

z =1
Parameter variation
Remarks
1) The derivatives of the objective functions
o Sigmoid
f (v ) =
1
a exp(av)
, a > 0 f ' (v ) =
= a f (v) [1 f (v)]
1 + exp(av)
[1 + exp( av)]2
o Hyperbolic tangent
f (v ) = a tanh(bv), a, b > 0 f ' (v) =
b
[a f (v)] [a + f (v)]
a
77
2) Learning rate > 0
For small values: low convergence speed; a quite smooth trajectory is followed within the
search space
For large values: risk of unstable behavior
Improvements:
2a) use inertial back-propagation (with momentum)
2b) use a distinct learning rate for each link
2a) use inertial back-propagation (with momentum) - explanations

Generalized delta rule:
wijl (n) = wijl (n 1) + il ( n) y lj1 (n) , > 0, momentum constant
n t
n t
t =0
t =0
wijl (n) = n t il (t ) y lj1 (t ) = n t
wijl (n) = [ n
I
(t ) ,
wijl
I
I
I
(0) + n 1 l (1) + ... + l (n)] .
l
wij
wij
wij
[0,1] has a stabilizing effect:

when
when
I
wijl
I
wijl
(t ) keeps the sign at successive iterations, the absolute value of wijl increases
(t ) changes the sign at successive iterations, the absolute value of wijl
decreases
78
3) online or batch learning?

-
For online learning: training samples must be randomly presented to avoid cycling
Online learning
Reduced memory consumption
Faster learning for large training data sets
Convergence hardly to analyze (the examples must be randomly presented for avoiding the
stagnation in local optima)
Good results for training data sets containing similar samples
4) The initialization of weights

- The result is dependent on the initial ANN parameters
o Use uniformly distributed or normally distributed (mean 0, spread illustrating the

saturation of some neurons).
5) Stop criteria
- only some recommendation can be made:
Recommendations:
o The norm of the gradient becomes close to 0
Disadvantage: numerous epochs can be involved.
o The variation of the criterion I becomes insignificant
Disadvantage: premature stop.
6) Efficient exploitation of the training samples

-
For online learning, the successive samples can be different

When the training samples are randomly presented, this condition could be frequently
met.
Outliers can impede the convergence and can lead to bad generalization capabilities.
79
7) The learning faster for asymmetric activation functions

f ( v ) = f (v )
Ex:
Symmetric limiter
Hyperbolic tangent: f (v ) = a tanh(bv) ,
recommended values (LeCun) a = 1.7159 , b = 2 / 3 , with
f
(0) 1.14 .
v
8) Learning rate
The neurons should learn at the same speed.

-
Usually, the gradients in the output layer are bigger, so should be smaller for the output
neurons.
The neurons having more links can work with smaller .

1
LeCun suggests: =
, m = the number of input links for a certain neuron.
m
-
ATENTION!!! - Generalization capacity

Training = approximation in terms of the training data set
Generalization = approximation in terms of another data set
/ validation /
If N is too large or the ANN architecture is too complex, the model results overfitted
select the simplest function possible,

if there is no information to invalidate this
80
Applications of MLP - Function approximations

MLP = universal approximator
Theorem:
Any continuous bounded function can be approximated with any desired degree of accuracy
> 0 , by means of a MLP containing
o a hidden layer with m neurons, characterized by continuous, bounded, monotonic

activation functions;
o a hidden layer with a linear neuron ( or sigmoidal neuron working within its linear
region)
i =1
j =1
F (u) = i f ( wij u j + bi )
m = the number of hidden neurons
R = the number of inputs
Remarks regarding the content of this theorem:

-
MLP existence is guaranteed
the theorem does not give any indications concerning the resulted generalization capacity of the
model and the time requested for learning
the optimal structure is not given
Remarks regarding the applicability of this theorem:
- the value of m:
o if m is small, the empiric risk is lower (reduced risk to learn the noise captured by the
training samples);
o if m is large, a good accuracy can be obtained;
- when a single hidden layer is used:
o the parameters of the neurons tend to interact: the approximation of some samples can
be improved solely by accepting worse approximation for other samples
For ANNs with 2 hidden layers: the hidden layer 1 extracts the local properties
the hidden layers 2 extract the global properties
81
3.4. ANN with Radial basis functions - RBF

The neuron of RBFs
The structure for the hidden neuron of the RBFs :
p1
c1
...
y=f(n)
n
cj
pj
....
cR
pR
See demorb1
y = f ( p c ) = f ( ( p1 c1 ) 2 + ... + ( p R c R ) 2 )
c1
p1
p = ... , c = ...
c R
p R
c = center vector (a center for each input connection)
Usually, Gaussian activation function is used:

y = exp(
p c
) = exp(
2 2
p1
c1
p = ... , c = ...
p R
c R
( p1 c1 ) 2 + ... + ( p R c R ) 2
2 2
c = vector of centers for a hidden neuron

= spread
82
Remarks:
The neuron is activated only if the input (vector) is similar to the center (vector).
o The accepted similitude level is given by .
o If is large, the neuron is activated for reduced similitude between inputs and centers.
For inputs which are very dissimilar to the centers, the neuron is inactive:
y 0, for p c >> 0 p, c very different.
Comparison between Gaussian neuron and perceptron

p j has more significant influence for the activation of the neuron
if the absolute value of p j w j is larger.
RBF architecture
Standard architecture includes
an output linear neuron
a single hidden layer with s Gaussian neurons.
-
Because a single hidden layer is considered the upper index will be deleted for almost of
the notation (it was only kept for making distinction between the linear and the radial
basis activation functions).
n1
y1 = f1(n1)
u1
....
uj
um
ni
cij
....
w1
...
ci1
yi = f1 (ni)
b
n
wi
cim
....
ws
ns
ys = f1 (ns)
83
f2 (n) = y
y = f ( w1 y1 + .. + ws y s + b) = w1 y1 + .. + ws y s + b = [w1
2
y1
s
.. ws ] ... + b = wi yi + b
i =1
y s
yi = f 1 ( u c i ) = f 1 ( (u1 ci1 ) 2 + ... + (u m cim ) 2 )

u1
c i1
u = ... , c i = ...
u m
cim
c i = center vector for the hidden neuron i
For Gaussian activation functions within the hidden layer:
2
s
(u ci1 ) 2 + ... + (u m cim ) 2
) = b + wi exp( 1
)
i =1
i =1
2 i 2
2 i 2
ci = center vector for the hidden neuron i
i = spread for the hidden neuron i
s
y = b + wi exp(
u ci
RBF = universal approximator
RBF for classification problems

Covers Theorem for pattern classification:
A complex classification problem (nonlinearly separable) has great chances to become linearly
separable via a nonlinear mapping provided to a space of high dimension.
Let us consider the samples u(i ) = [u1 (i ) .. u m (i )]T belonging to m .
(e.g. the training samples).
f1 (u)
Let us consider f : , s large , with f (u) = ... , f1 ,..., f s : m .
f s (u)
(e.g. f1 ,... f s indicate the mappings provided by s hidden neurons)
m
84
Definition
The classes C1 ,C 2 are f-separable, if there exist w = [w1 .. ws ]T s with:
wT f (u) > 0, for u C1

wT f (u) 0, for u C2
Remarks:
-
according to Covers theorem: chose large s and non-linear f1 ,... f s
the f hiper-plane delimitating the classes is given by w T f(u) = 0 .
fi could be radial basis ones.
Example: XOR problem

1
0
1
0
Classify the samples: u(1) = C1 , u(2) = C 2 , u(3) = C 2 , u(1) = C1
1
1
0
0
Let us consider:
2
1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
2
u2
0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0
u1
Knowing the inputs samples, it results:
f2(u)
f1 (u(1)) = 1, f 2 (u(1)) = 0.13

f1 (u(2)) = 0.36, f 2 (u(2)) = 0.36
f1 (u(3)) = 0.36, f 2 (u(3)) = 0.36
f1 (u(4)) = 0.13, f 2 (u(4)) = 1
f1(u)
85
RBF for function approximations (interpolation)
Find F : m accepting (u (i ), d (i )) , i = 1, N ,
with u(i ) = [u1 (i ) .. u m (i )]T m and d (i )
d (i ) = F (u(i )) = the desired output of the function corresponding to input u(i )
(these samples could be used for training).
Find the interpolation:

N
F (u) = wi f i ( u u(i) )
i =1
the number of radial basis functions = the number of the training samples;
the functions f i accept the centers c i = u(i ) .
Radial basis function could be chosen as follows:
a) f i (u ) =
b) f i (u ) =
u ci
+ qi 2 , qi > 0 : non local, unbounded
1
u ci
c) f i (u ) = exp(
+ qi
u ci
2 i 2
, qi > 0 : local, bounded
) : local, bounded
86
Knowing that d (i ) = F (u(i )) , it results:

f1 (u (1)) ..
..
..
f1 (u ( N )) ...
f N (u (1)) w1 d (1)
.. = .. .
..

f N (u ( N )) w N d ( N )
Let us consider:
f1 (u (1)) ..
..
= ..
f1 (u ( N )) ...
f N (u (1))
..
= interpolation matrix.
f N (u ( N ))
Using this notation, the equation can be rewritten as follows:
w1 d (1)
w1
d (1)
1
.. = .. .. = .. - if is nonsingular.
wN d ( N )
wN
d ( N )
Michellis Theorem (1986)

If f i are radial and samples u (i ) m are distinct
Then is nonsingular.
Remarks:
o For f i of types b) and c) is positively defined.

o For f i of type a) admits N 1 positive Eigen values and a negative Eigen value.
Remarks:
o large N (many samples) many radial basis functions complex model (over-fitting)
o large N (large samples) risk of poorly conditioned interpolation matrix and large execution
times
87
o It is desirable to use fewer radial basis functions than training samples.
s < N , s = the number of the hidden neurons.

Instead of
N
F (u) = wi f i ( u u(i ) ) ,
i =1
one has to consider

s
F (u) = b + wi f i ( u ci ) :
i =1
The centers of the radial basis functions and the input training samples are different.
The output neuron accepts nonzero bias.
Knowing that d (i ) = F (u(i )) , i = 1, N it results:

f1 (u (1)) ..
..
..
f1 (u ( N )) ...

f s (u (1)) 1 w1 d (1)
.. .. = .. .
..
w
f s (u ( N )) 1 s d ( N )
b
Let us denote:
f1 (u (1)) ..
G =
..
..
f1 (u ( N )) ...
f s (u (1))
..
f s (u ( N ))
1
N x ( s +1)
.
Therefore, it results:

w1 d (1)
d (1)
w1

G .. = .. .. = G .. ,
w
ws d ( N )
d ( N )
s
b
b
G + = (G T G ) 1 G T
88
Example: Revisit XOR classification problem

1
0
1
0
Classify the samples: u(1) = C1 , u(2) = C 2 , u(3) = C 2 , u(1) = C1
1
1
0
0
Let us consider:
2
1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
2
0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0
For the above mentioned input training samples it results:
f1 (u(1)) = 1, f 2 (u(1)) = 0.13
0.13 1
1
f1 (u(2)) = 0.36, f 2 (u(2)) = 0.36

0.36 0.36 1
G=
.
0.36 0.36 1
f1 (u(3)) = 0.36, f 2 (u(3)) = 0.36
1 1
f1 (u(4)) = 0.13, f 2 (u(4)) = 1
0.13
Let us define: d (1) = 0, d (2) = 1, d (3) = 1, d (4) = 0
Therefore, it results: -
G + (given by the MATLAB function pinv):
1.7942 -1.2195 -1.2195 0.6448

0.6448 -1.2195 -1.2195 1.7942
-0.8780 1.3780 1.3780 -0.8780
0
w
1 2.439
1
w = G + = 2.439
1
2
2
.
7561
b

0
89
Theorem:
Any continuous bounded function F : m can be approximated with any desired degree of
accuracy by means of:
s
u ci
i =1
F (u) = b + wi f (
), > 0 , if
f : m is bounded and f (u )du < .

m
The requirements imposed by this theorem are met for the radial functions b), c).
The radial basis function a) can be used for s = N , only.
f is not necessarily symmetric !!!!!
For Gaussian functions:

s
F (u) = b + wi exp(
i =1
uc
2 2
) , when the same spread is employed for all the hidden neurons
Recommendation: Chose s = 3 N
Remark:
the ANN with hidden radial basis activation functions and a linear output neuron is compliant
with the requirements of the previous theorem.
-
training = optimization carried out in terms of the training data set
step 1: select the centers and the spread

step2: assuming the centers and the spread are known, compute the output weights:

f1 (u (1)) .. f s (u (1)) 1
d (1)
w1
N x ( s +1)
.. = G + .. , cu G =
..
..
..
w
f1 (u ( N )) ... f s (u ( N )) 1
d ( N )
s
b
Challenge: what centers to chose?
If the centers are known, the output weights can be computed in a single step.
generalization = interpolation
90
Comparison between MLP and RBF
RBF
MLP
Any number of hidden layers
One hidden layer
Input operator = scalar product
Input operator = Euclidian distance
Nonlinearity in terms of neural parameters
Linearity in terms of output parameters, if

fix centers and spreads are assumed
Large training time required
Small training time if the centers are

known (useful for on-line training)
Global action
Local action
Fewer parameters for the same degree of

accuracy (usually)
Learning strategies
1. Random centers selection
Step 1. Chose the centers randomly (uniformly distributed over the input range).
d
Step 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.
Step 3. Compute the weights and the bias.
91
2. Centers self-organization
Step 1. The centers are chosen via clustering of training input samples (e. g. K-mean clustering)
K-mean clustering (tip: learn via competition):
Step 1-0: Chose random distinct initial values for all s centers, denoted c i (n) , with n = 0 and
i = 1, s .
Step 1-1: For the training sample u(n) , compute u(n) c i , i = 1, s and find the minimum
distance, which indicates the nearest center for this sample. Consider i , with i 1, s , the nearest
center.
Step 1-2: Update the nearest center, moving it towards the sample:
c i (n + 1) = c i (n) u(n) c i (n) , cu 1 > > 0
Step 1-3: n n + 1
Step 1-4: If some training samples have not been used yet, or the change made at step 1-2 is too
large,
Go to step 1-1.
Drawback: the result depends on the initial values
d
Step 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.
Step 3. Compute the weights and the bias.
92
3. Supervised centers selection

The parameters of RBF are adapted by using error correction:
LMS algorithm:
let us consider the batch learning
criterion :
I=
s
u( j ) c i 2
1 N
1 N
2
)]
e( j ) = [ d ( j ) wi f (
2 j =1
2 j =1
i
i =1
convex in terms of weights;

non-convex in terms of centers (centers optimization can lock in local optima)
For Gaussian activation functions:
2
s
u ( j ) ci
1 N
I = [ d ( j ) wi exp(
) ]2
2
2 j =1
2 i
i =1
At each iteration, the parameters of RBF (weights, centers, spreads) are updated according to the
following rules:
- for weights:
wi wi 1
I
, with
wi
2
N
N
u( j ) c i
u( j ) c i
I
= e( j ) f (
) = e( j ) exp(
)
i
wi j =1
j =1
2 i 2
- for centers:
ci ci 2
I
, with
c i
2
u( j ) c i
wi N
I
=
) (c i ) k [u( j ) k (c i ) k ] ,
e( j ) exp(
(c i ) k i2 j =1
2 i 2
with (c i ) k , u( j) k indicating the kth component of the vectors c i , i = 1, N and u( j ), j = 1, N

(having the length m)
93
- for spreads:
i i 3
I
, with
i
N
u( j ) c i
u( j ) c i
w N
I
= e( j ) f (
) = i e( j ) exp(
i
i j =1
2 i 2
4 i 3 j =1
) u( j ) c i
4. Constructive algorithm
- insert the hidden neurons in sequel the center vector copies the input training sample that produces
the highest output squared error for the current architecture
> see MATLAB
CHAPTER 4. NEURO-GENETIC SYSTEMS

= evolutionary artificial neural networks or neuro-genetic systems
ANN
robustness
capacity of inductive learning (supervised or unsupervised)
high computational capacity
parallelism
+
AG
robustness, flexibility
scarce a priori information concerning the objective function
The symbiosis higher adaptation capacity.

94
Classification in terms of the inference provided between GA and ANN:
- supportive
reduced cooperation between GA and ANN.
The methods are sequentially, separately applied considering two distinct subproblems, or are independently used for solving the same problem.
- collaborative
strong cooperation between GA and ANN.
These combinations exploit more advantageously the merits of the involved
techniques.
4.1. Supportive neuro-genetic systems

They involve weak cooperation between GA and ANN.
One technique assumed the leader role, the other one the secondary role,
or both techniques are used for solving the same problem.
A. GA and ANN used for solving the same problem

The solutions provided by GA and ANN are used in parallel this redundancy can be useful for
diagnosis systems.
B. GA primary role, ANN secondary role

ANN helps in generating the initial the initial population..
ANN delivers additional information concerning the feasible space (e. g. after a
certain classification of feasible-unfeasible samples).
50% of the initial population is generated randomly, 50% using the ANN.
95
C. ANN primary role, GA secondary role
C1. GA used for preparing the input data for the neural classifiers:
feature(input) selection:
Aim: improve the recognition rate and the execution times via the selection of few
relevant features.
Assuming the binary encoding, a locus can indicate the use/absence of a feature.
The drawbacks result from the fact that the method involves large computational time,
as the evaluation of each chromosome demands training the corresponding classifier.
o Chang & Lippman obtained 80% reduction of features in a voice recognition problem.
o Guo & Uhrig designed a diagnosis system based on neural observers for a nuclear plant. The
GA decided which are the inputs of each observer (given a large set of thousands available
variables).
Aim: ANN should have few inputs, and should be precise, so the objective function can be
defined as follows:
0.7 3 t +1
0.15(t +1)
f ( x) = (e ( z 1)
) (1 e 0.01err
) , where
x denotes the chromosome which has to be evaluated,
z=
no. of variables
,
no. of selected variables
t (1, NR _ MAX _ GEN ) denotes the number of the generation,

err denotes the error of the ANN computed at the end of the training stage.
Assuming the binary encoding, 1/0 indicates the use/the absence of the corresponding
plant variable.
A similar problem was solved by Weller.
96
input space transformation:

GA is used for selecting scaling and/or rotation parameters.
These transformations are meant to ensure better separation of classes (smaller distances
between the samples of the same class, larger distances between the samples belonging
to distinct classes).
training data set configuration:

The training samples are chosen from a large database.
A chromosome specifies the samples to be used and the sequence in which they have to
be delivered during training.
If too few measurements are accessible, Cho & Cha suggest the genetic production of
virtual samples. In order to evaluate each resulted training data set, this set must be used
for explicitly training the ANN.
C2.GA used for setting the parameters of the training rules.

e. g.: the learning rate involved by the back-propagation algorithm, or the coefficients
used in other adaptation rules (Chalmer, Bengio).
C3.GA used for analyzing the ANN behavior.

Some explanations concerning the behavior of ANN result by depicting the regions of
the input space which correspond to the minimum, the maximum and the threshold
values of the output neurons.
To fit this end, a chromosome encode an input vector and the objective function can be
defined as
2
f ( x) = y yd , where
x denotes a chromosome,
y represents the neural output corresponding to x,
yd indicates the target output (minimum, maximum, threshold).
97
4.2. Collaborative neuro-genetic systems

o Strong cooperation between GA and ANN is considered.
o The symbiosis leads to better adaptation capability
o GA used for
training the ANN
and/or
select the ANN topology.
>> better accuracy and better generalization capabilities.
A.
GA used for training
Unlike gradient based training, GA learning is robust and reduced the risk of stagnation
in local optima.
Genetic training can be also used for ANNs with non-derivable activation functions or
recurrent connections.
Genetic training involves large computational times, however better convergence speed
can be achieved via hybridization with local optimizations.
The neural topology is known
A chromosome encodes the whole set of parameters.
One can use binary and float encoding.

98
Usually, the objective function is the mean output error computed for the whole
training data set.
Competing conventions
-
The crossover produces offspring less adapted than their parents.

o If the genetic sub-chains corresponding to hidden neurons are permutated, the
functionality of the encoded ANN remains the same, yet the genotype is changed
significantly.
(any hidden neuron is represented by a sub-chain encoding its parameters)

o If the parents encode similar ANNs in different genetic strings, the offspring can be
significantly less adapted.
o If the offspring are implicitly inserted in the population of the next generation, then the
convergence speed is dramatically altered.
Ideas for reducing/avoiding the risk of competing conventions:

- use mutation only (no crossover or small Pc).
Saravanan &Fogel use Gaussian mutation.
Each individual generates an offspring & tournament based selection (k=10) & ( N + )
insertion.
- use special crossovers (Hancock).

Radcliffe uses similitude based crossovers similar blocks of the parents are sent
unchanged to the offspring.
The similitude can be evaluated by means of Hamming distance (for binary encoding) or by
using the rate of identical parameters. Before crossover, the neurons are re-sorted within the
chromosome.
Hancock analyzed empirically the performances of Radcliffes crossover and multipoint
crossover. The results indicated very good results for the multipoint crossover, too (even
better results, if the selection pressure is high).
99
cascade correlation (CC) algorithm.

o The algorithm starts with a simple structure including an input and an output layer.
o The ANN is trained subject to the minimization of the output squared error.
o If the accuracy is inappropriate, a new sigmoidal hidden neuron is introduced. Its
input links are coming from all the neural inputs and from all existing hidden
neurons. Their weights are adapted via maximizing the covariance between the
squared output error and the outputs of the hidden neurons.
o The output of the new hidden neuron becomes input for the output neuron(s). The
weights of these new connections are computed via minimizing the output squared
error.
o As a single neuron is inserted at each stage, competing conventions cannot occur.
Genetic version of CC.
step 1: Initialize the minimal neural topology (2 layers: input and output).
step 2: Adapt the weights, for N _ ep epochs, by means of genetic learning.
step 3: Test if ANN accuracy is convenient E < E0 (E = output squared error):
Yes go to 8.
No continue with 4.
step 4: Insert a new hidden neuron, denoted with N.
C1 = the set of Ns input connections (coming from the neural inputs and the other hidden
neurons),
C 2 = the set of Ns output connections.
Initialize the weights of these links with random values close to 0.
step 5: Adapt the weights for C1 and the bias of N, by maximizing the covariance between
the outputs of the hidden neurons and the squared output error; the genetic
algorithm is applied for N _ ep _ 1 epochs (let C denotes the best objective value).
step 6: Adapt the weights of C 2 , by minimizing the output squared error; the genetic
procedure is applied for N _ ep _ 2 epochs.
step 7: Go to step 3.
step 8: Stop.
100
Advantages:
- the parameters of a single neuron are trained at each stage;
- the algorithm constructs the neural topology too (without genetic techniques);
- the hybridization CC- GA allows the selection of simpler topologies, at the cost
of increased computational time.
Improvements of CC
o Potter:
The weights of C2 are found by selecting the additive values belonging to (0,-C),
which correct the best adapted individuals obtained at the precedent neuron
insertion.
Chen design the RBF genetic selection of spreads:
f ( x) = err + g T g , where
err denotes the output squared error,
g is the vector of weights,
[] T indicates the transposed of a matrix.

Centers determined by means of Orthogonal Least Squares (OLS),
Weights determined by means of Least Squares.
101
Other genetic training algorithms

Hung & Adeli: two stages training:
- first stage apply a genetic training;
- second stage apply conjugated gradient, using as initial point the solution delivered at the first stage
Topalov:
-
apply genetic training; whenever GA stagnates, switch to back-propagation.
Tsinas & Dachwald:

-
multiple sequences of training, each one consisting of genetic training followed by back-propagation; the maximum
number of commutations (sequences) is preset.
Ng:
- apply back-propagation; if the output squared error is too big and its variation during the previous epochs is
insignificant, then a GA is used to guide the search far away from the local optimal point.
GA aims the minimization of the output squared error and uses Gaussian mutation.
Large permits too longer stagnations, whilst small can generate false alarms.
Ku:
- train the recurrent neural networks by means of genetic diffuse algorithms.
The chromosomes are organized according to a matrix-based topology.
Crossover acts between neighbors, only.
B.
GA employed for neural topology selection

ANN structure influences the overall performances:
o Too simple: low accuracy
o Too complex: longer training and evaluation;
generalization capability.
expected
lower
Two distinct standpoints can be considered:
Improvement of ANN learning abilities - shown by: accuracy, speed of training,

generalization capacity.
Main challenge: is a certain topology appropriate for learning a specific functionality?
Better understanding of the neural representation find ways for modeling symbolic
knowledge.
Most researches are focused on the first direction.
102
GA can provide a more flexible selection of the neural topologies.

(can consider any topology)
GA can explore large and multimodal search spaces.
No a priori information regarding the searching trajectory is required.
Traditional algorithms (constructive or destructive)

search within a limited area, only
To evaluate a neural architecture, a convenient set of parameters must be associated

(e.g. by using a non-evolutionary training algorithm).
The number of training epochs carried out for evaluating the individuals must be as
low as possible.
Expected noise:
- the performances of the ANN will be influenced by the initial random values
of the neural parameters.
- the performances of the ANN will be influenced by the training algorithm;
to eliminate this drawback, GA can also work on the neural parameters.
103
Problem: find an appropriate encoding of the neural architecture

The encoding must be compliant with:
Correctness

allow a simple verification of the correctness of the encrypted topology

genetic operators must produce individuals encoding correct topologies
Sensitivity in terms of genetic operator :

control the impact of the genetic operators (e.g. acting on the links, specific inner
structures, etc.).
The methods can be divided in direct and indirect encoding.
B1. Methods based on direct encoding of the neural topologies.

The strategy is appropriate for ANNs with few neurons and layers.
The chromosomal encoding is devoted to a specific ANN.
Most applications consider the MLP.
Possible chromosomal encodings:
matrix based (e.g. the chromosomes encode the matrix of connections)
vector based (e.g. the chromosomes result by reshaping the matrix of connections)
tree/graph based (e.g. the chromosomes encode the tree of connections).
Compliant genetic operators must be designed.
Hybridization with non-evolutionary local optimizations

usually, in a Lamarckian manner ( = compute/ improve the
neural parameters + store the new parameters).
104
Ideas for reducing/avoiding the risk of competing conventions:

- crossover rarely used or avoided.
Maniezzo: recommend the use of crossover (for preserving the diversity inside
the population)
Angeline, Lee: use mutation only; mutation can act on the structure and the
parameters:
structural mutation: introduce a neuron or a link, delete a neuron or a link.
Immediately after insertion, the neurons are not connected with the rest of the
ANN, the new connections being added by successive appliance of mutation
(small changes are allowed, only).
Pujol & Poli:

Use a dual chromosomal representation: matrix based and vector based
Produce offspring on both encodings >> good diversity.
Examples
Braun & Zagorski: consider a term describing the complexity of the encoded
ANN within the objective function.
Improved genetic operators: e. g. the neurons which are deleted are stored for
further potential insertions.
Dasgupta, Mann: hierarchical encoding higher levels include control genes,
the leaves correspond to parametric genes.
Changes performed within the upper levels correspond to significant alterations
of the encoded neural architecture. Changes performed within the lower levels
correspond to less significant alterations of the encoded neural architecture.
Thierens: use a canonical representation which eliminates the effects produced
by the symmetries of the activation functions, the permutations of neurons/links,
etc. To fit this end, several transformations are made.
Negative biases are changed to positive ones, and additionally the sign of all
the incoming corresponding weights is also changed.
Then, the neurons are sorted in terms of bias.
105
Sato & Nagaya , Sato & Ochiai: use matrix based encoding for evolving the
neural architecture and the neural parameters for ANNs with binary weights..
Romaniuk: genetic CC for selecting the architecture of neural classifiers
o Recurrent ANNs with sigmoidal activation functions are considered.
o The algorithm starts with a simple structure and adds new structures which
are genetically configured. The blocks which were already inserted remain
unchanged at the next steps.
o Additionally, the resulted topologies are simplified by deleting the less
important links.
o The significance of a link is established in the following manner: in sequel,
each connection is deleted and the response of the ANN is evaluated for all
the input training samples, and the new resulted faults are counted. Small
counters indicate insignificant links.
Liu i Yao: select the architecture and the parameters of generalized neural
networks.
o These ANNs include both sigmoidal and Gaussian neurons.
B2. Methods based on indirect encoding of the neural architectures.

- may involve a more expensive evaluation,
- may lead to shorter chromosomes
Recommended for ANNs with many neurons and layers, featuring structural
regularities
Main categories of indirect encodings:

parametric
Encode the parameters which describe the architecture, such as: the number of layers, the
number of neurons within a layer, the type of accepted connections. Usually, this encoding
refers to a limited number of possible topologies.
developmental (grammar based)

The chromosome encodes a sequence of actions which allow the ANN generation, not the
ANN itself. The actions are described via a predefined grammar.
106
Gruau: cell based encoding.
o The algorithm was used for ANNs with binary and float weights. Good results were
obtained for symmetric ANNs.
o
The neural architecture is build via a cellular division process. The algorithm starts with a root
cell. Each cell possesses internal registers for storing the weights and the bias. The proposed
language indicates the following actions: cellular division, the transformation of a cell to a
neuron, choosing the values of the neural parameters, including delays, including recurrence.
A chromosome corresponds to the simplest architecture within a family of topologies.

The members of the family can be obtained by recurrently adding the structural blocks.
For the first member of the family the recurrence is not activated. For the second member the
recurrence is activated only once, etc.
A chromosome is evaluated by considering the first k members of the family, for which the
topology p + 1 gives better results than the topology p, p 1, k 1 , and the topology k + 1 is
worse than the topology k. The objective function is equal to the sum of output squared errors
corresponding to the first k members.
o The genetic operators act solely on the first member of the family. Both crossover and mutation
can be used. Higher probabilities are assigned for changing a symbol to recurrence of vice versa
are assigned. The individuals can be improved by Lamarckian local optimization.
Another classification of the genetic approaches devoted to neural architecture

selection
- how the chromosomes are used for generating the neural architecture
A.
Each chromosome encodes a single neural architecture.

The result of the algorithm is usually considered the architecture of the best
adapted individual found during the evolutionary loop or in the final population.
Yao & Liu produce the delivered neural architecture by combining the genetic
material of the individuals included in the last population.
o Various types of combinations were suggested.
o The resulted ANN has better performances at higher computational costs.
107
B.
The whole population forms a single neural architecture.

In this case, the individuals are competitors, yet they must also cooperate.
Smith & Cribbs: ANN with binary weights and hard limiter activation functions.
- the population includes structural blocks which need to be aggregated in order to
build the ANN (a chromosome encodes a structural block).
- when the population contains multiple copies of an individual, a single copy is
used within the neural structure.
- the output weights are computed with Widrow-Hopf (non-genetically).
Output layer
Weights set without GA
Blocks improved
with GA
NNCrom1
NNCrom2
----
NNCrom N
Blocks encoded by
the chromomes
Fitness is computed as follows.

For each training sample:
- if the response of the ANN is correct, then all n1 chromosomes having the output 1 are awarded
with the fitness 1 / n1 ; all n2 chromosomes having the output 0 are awarded with the fitness
1 / n2 .
-
If the response is incorrect, then all n3 chromosomes having the output 1 are awarded with the
fitness 1 / n3 ; all the chromosomes having the output 0 cannot participate to error correction,
therefore their fitness will not be changed.
High fitness is assigned to a structural block which is useful for many samples or which is
the main contributor for specific samples.
108
REFERENCES
Affenzeller, M., Winker, S., Wagner, S., Beham, A. (2009). Genetic Algorithms and
Genetic Programming Modern Concepts and Practical Application. Boca Raton, FL,
CRC Press, 157-207.
Angeline P. J., Saunders G. M., Pollack J. B., (1994). An Evolutionary Algorithm that
Constructs Recurrent Neural Networks, IEEE Transactions on Neural Networks, 1 (5),
55 - 65.
Ashlock, D. (2006). Evolutionary Computation for Modeling and Optimization,
Springer, New York.
Baluja S. (1996). Evolution of an Artificial Neural Network Based Autonomous Land
Vehicle Controller, IEEE Transactions on Systems, MAN and Cybernetics - part B, 26
(3), 450 - 463.
Bck T., Fogel D., Michalewicz Z. (2000). Evolutionary Computation 2. Advanced
Algorithms and Operators, Institute of Physics Publishing, USA.
Barton, A. J., Valds, J. J., Orchard, R. (2009). Neural networks with multiple general
neuron models: A hybrid computational intelligence approach using Genetic
Programming, Neural Networks, 22, 614-622.
Bengio S., Bengio, Y., Cloutier, J. (1994). Use of Genetic Programming for the Search
of a New Learning Rule for Neural Networks, Proc. of Conference on Evolutionary
Computation, USA, 324 - 327.
Benuskova L., Kasabov N. (2007). Computation Neuro-genetic Modeling, Springer, New
Zealand.
Bonarini, A. , F. Masulli, G. Pas (2003). Soft Computing Applications (Advances in Soft
Computing), Physica Verlag Heildeberg.
Braun H., Zagorski P., (1994). ENZO II a Powerful Design Tool to Evolve Multi-layer
Feed Forward Networks, Proc. of Conference on Evolutionary Computation, USA, 278 283.
Coello Coello, C.A., Lamont, G.B., Van Veldhuizen, D.A. (2007). Evolutionary
Algorithms for Solving Multiobjective Problems, 2nd Edition. New York, NY: Springer,
50-150.
Da Ruan (1997). Intelligent Hybrid Systems, Kluwer Academc Publisher, USA.
De Jong, K. A . (2006). Evolutionary Computation - A Unified Approach. Cambridge.
MA: MIT Press.
DiMattina, C. (2010). How to Modify a Neural Network Gradually Without Changing
Its Input-Output Functionality, Neural Computation 22, 147.
Dumitrache I., Buiu C. (1995). Introduction to Genetic Algorithms, Ed. Politehnica,
Bucuresti, Romania.
Ferariu L.(2005). Algoritmi evolutivi in identificarea si conducerea sistemelor,
Politehnium, Iasi, Romania.
Ferariu L. (2010). Sisteme neurogenetice, Politehnium, Iasi, Romania.
109
Fleming P. J., Purshouse R. C. (2002). Evolutionary algorithms in control systems

engineering: a survey, Control Engineering Practice, 10, 1223 - 1241.
Fogel, D. (2006). Evolutionary computation Toward a New Philosophy of Machine
Intelligence, 3rd Ed., Piscataway, NJ: IEEE Press.
Gruau F. (1993). Genetic Synthesis of Modular Neural networks, Genetic Algorithms
Proc., 312 - 317.
Haykin S. (2009), Neural Networks and Learning Machines, 3rd Edition, Prentice Hall,
USA.
Knowles, J., Corne, D., Deb, K. (Eds.). (2008). Multiobjective problem Solver from
Nature From Concepts to Applications. Pondicherry, India: Springer, 131-154.
Purshouse, R., Fleming P. (2006). On the Evolutionary Optimization of Many
Conflicting Objective?s, IEEE Transactions on Evolutionary Computation, 11 (6), 770784.
Smith R. E., Brown Cribbs III H. (1997). Combined Biological Paradigms: A Neural,
Genetic-based Autonomous System Strategy, Robotics and Autonomous Systems 22, 65
- 74.
110

13

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

13

Încărcat de

Drepturi de autor:

Formate disponibile

Lavinia Ferariu

Copyright 2013, Editura Conspress i autorul

Lucrare elaborat n cadrul proiectului: "Reea naional de centre pentru

Descrierea CIP a Bibliotecii Naionale a Romniei

Chapter 1. Introduction ------------------------------------------------------------ 1

Intelligence learning capacity + adaptability.

Human intelligence vs. Machine learning

Model of human cognitive development (Piaget)

Human psychological development

Learning strategies classification

Artificial intelligence was successful for well delimited problems,

CHAPTER 2. GENETIC ALGORITHMS

First ideas regarding the evolution of species - Charles Darwin (1859).

Recently, Neo-Darwinism explains the mechanisms of inheritance based on Darwinian theory.

The natural model is adopted in a simplified version

Best adapted structures survive to the next generation and contribute to

First trials: 1950, Bremermann, Friedberg, Box

Most common evolutionary algorithms:

Almost of the research is targeted to applications:

Theoretical background - insufficiently developed.

2.2. Genetic algorithms - overview

o It can approach complex optimizations:

Every iteration (generation), a set (population) of potential solutions (individuals,

The process is repeated for an adequate number of generations. If no additional special

repeat until t < No_Generations

display the best individual of the population;

An individual of the population = a chains of characters

for binary encoding the alphabet is {0,1}.

Considering the optimization problem

Depending on the number of the cutting points:

It works on a single operand:

Mutation for binary encoding

Individuals evaluation. Selection for recombination and survival

Usually, the quality of an individual is indicated as follows:

as absolute value, by means of the objective function

in comparison with the other individuals of the set, by means of fitness.

By excessively encouraging the selection of superior individuals, the exploration is guided

The properties of genetic algorithms

 GA work in parallel on a population of solutions;

Iterative optimization methods

0 order methods use the objective values, only;

Deepest descent (gradient) method:

The objective function f(x)

The advantages of genetic algorithms

The drawbacks of genetic algorithms

GA request huge computational resources (time + memory).

2.3. Evolutionary algorithm an artificial intelligence technique

The examples (individuals) are created without a supervisor.

Evolutionary algorithms make use of sub-symbolic chromosomal representation.

2.4. Main research directions

2.5. Genetic encoding

a) change the problem statement

Two basic approaches for GA

A. Change the problem statement in compliance with canonical GA.

The decision variables are encoded with a finite alphabet.

the standard GA could be applied without any changes;

the exploration space S is mapped to S*;

Encoding (usually not a bijection!!): S S*, in S* - a string of genes (each

Genetic steps are carried out in different spaces:

Key issue - find a proper encoding.

Most popular: GA with binary encoding (canonical genetic algorithms).

For optimization problems involving continuous decision variables:

the designer must indicate the length of the chromosome, l:

GA work in parallel on a population of solutions;

Encoding (usually not a bijection!!): S S, in S - a string of genes (each

discrete crossover see binary encoding

simple arithmetic crossover it changes a single gene according to (*)

splitting: it splits a chromosome in two offspring.

adaptive multiple cutting point crossover introduced by Schaffer and Morishima - it