Documente Academic
Documente Profesional
Documente Cultură
EDITURA
CONSPRESS
2013
CONSPRESS
B-dul Lacul Tei nr.124, sector 2,
cod 020396, Bucureti
Tel.: (021) 242 2719 / 300; Fax: (021) 242 0781
CONTENT
ii
CHAPTER 1. INTRODUCTION
Intelligence =
The capacity of improving the own behavior based on acquired experience (by
repeating the same action or similar actions)
[Back, 2000].
Stages:
focused on movement control and sensors
o information acquired by the sensors - organized and processed;
o the first cognitive schemes are constructed; they refer mainly to own
body/behavior and neighbor objects;
preoperational
o symbolic thinking;
o capacity of generalization;
concrete operations
o deductive reasoning;
o higher interest for surrounding environment;
formal operations
o use of abstract concepts,
o specification and verification of the working assumptions, etc.
inductive learning:
by examples (acquisition of concepts):
look for universal rules describing all positive and negative examples.
by observations and discovery (unsupervised):
look for universal rules describing the observations
observations are obtained without a supervisor.
Short history:
Beginning: 1950s
Main research directions:
automatic proof of theorems, planning and prediction,
programming, human language understanding,
= > requirements for building the Machine Learning.
automatic
Knowledge representation
by symbols (classic)
a formal set of primitives and rules are employed for symbols handling:
o predicates,
o frames, semantic networks,
o fuzzy systems;
by numbers (sub-symbolic):
o Artificial neural networks,
o Evolutionary algorithms.
Modern Genetics studies the way in which information is encrypted by living organisms.
Evolutionary computation - translates the natural selection theory and evolution theory
to numerical algorithms.
Evolutionary programming
Goal: the design of finite state automatons able to predict the changes occurred in the working
environment.
The environment is described by a string of symbols (according to a finite encoding alphabet).
The algorithm searches the output symbol providing the fittest prediction.
Fogel, Burgin, Atmar.
Genetic programming
The algorithm searches the fittest program able to solve a certain problem.
Koza.
Evolutionary strategies
Meant to solve optimization problems with continuous parameters.
A structure encodes the values of the decision variables corresponding to a point of the search
space.
Unlike GA: other mechanisms are employed for enriching the genetic material throughout the
generations.
Rechenberg, Schwefel, Herdy, Kursawe, Ostermeier, Rudolph.
Classifiers
Devoted to the design of classifiers by means of evolutionary techniques.
Holland, Reitman, Booker, De Jong.
o It uses strategies borrowed from Genetics and Evolutionary Theories natural selection.
Problem statement
Let us consider f : S R n R .
The elements of x S are called decision variables.
Find:
arg min f ( x ) or
arg max f ( x ) .
x S
x S
Objective
Objective function
Objective value
General description of GA
The individuals are evaluated in terms of the objective and the best ones are encouraged to
survive and reproduce.
New potential solutions (offspring) are obtained by combining the genetic material of the
parents, similarly to the recombination of DNA chains in biological systems. This process
guides the exploration by using the most valuable genetic material of the current population.
Small variations of the offspring ensure an adequate preservation of population diversity, with
positive impact on avoiding the stagnation in local optima.
The offspring fight for survival with the old solutions. The best adapted solutions will have
greater chances to win this contest.
initialization:
t=0;
generate N random points (uniformly) distributed within the search space, to form the initial
population P(t);
Encoding
Most common: binary encoding.
The encoding ensures the mapping of the exploration space S to S*. The genetic operators will act in
S*.
The chain (string) used for encoding an individuals is called chromosome.
A position (character) in this chain is called gene or locus.
The values allowed for a certain gene are called alleles (e.g, for binary encoding 0 and 1).
The genotype indicates the structure of the chromosomes and the values of genes (it is related to S*)
The phenotype indicates the behavior of an individual obtained due to its specific genotype (it is related
to S).
Genetic operators
Crossover - it works on two operands: 2 parents 2 offspring
by interchanging some sub-chains.
parent A
offspring C1
parent B
printe A
offspring C2
a) single cut point crossover
copil C1
parents
offspring
b) multiple cut point crossover
Mutation
1 0 ....................0 1 0 ...........11
1 1 ....................0 1 0 ...........10
10
Remarks:
Genetic operators act according to stochastic rules (their probability is smaller than 1):
not all the pairs of parents formed from the recombination pool are combined by means
crossover.
the cutting points and the mutated genes are stochastically selected.
11
Generally, an individual better than average is encouraged to survive and to produce offspring, because
it contains a genetic material better than other solutions of the current population.
Potential downsides:
Even the worst solutions can generate fitted individuals by means of successive genetic changes
(performed via genetic operators).
To avoid the stagnation in local optima points and the premature convergence, an adequate
balance between convergence speed and diversity preservation is required. This balance is
mainly tuned by means of parents selection and offspring creation/insertion.
Stop criteria
Because the algorithm works randomly and in unsupervised manner, it is quite difficult to set
apriorically a proper stop condition.
The most common stop test deals with a maximum number of generations.
The maximum number of generations is tuned by trial and error.
Another stop criterion verifies if the differences produced on the individuals of the current
population become smaller than a predefined threshold.
If the individuals are still different, the evolutionary loop is continued. If the individuals become
too similar, the evolutionary loop is broken.
The allowed difference is difficult to set apriorically (more difficult than the number of
generations).
The encryption is very important, as small genotypic differences can involve big phenotypic
differences and vice versa.
12
1 order methods use the 1st order derivatives of the objective functions f
- assumption: the existence of the 1st order derivatives.
- example: deepest descent
the algorithm goes in the inverse gradient direction:
13
2 order methods - use the 2nd order derivatives of the objective functions
- assumption: the existence of the 2nd order derivatives
- the searching direction is inverse to the gradient one; the 2nd order derivatives impose
the search step at each iteration.
200
100
0
-100
-
0
The derivative of the objective function
500
0
-500
-1000
-
200
100
0
-100
-
x*
xA
xB
xC
xE
xD
14
decision variable x
They use the objective values, only UNIVERSALITY = can solve ANY optimization problem
(including those with discontinuous, non-differentiable objective functions )
These methods are called weak/soft, because they request scarce aprioric information about the
targeted problem.
Available additional information can be integrated within GA in order to improve the
exploration capability and/or the convergence speed (e.g. start from a particular initial
population).
GA are efficient for complex (nonlinear, multiobjective, constrained, multimodal) optimizations:
They can converge toward the GLOBAL optima.
GA are EASY to implement and accept FLEXIBLE configuration.
GA are suitable for PARALLEL implementation.
15
involve
unsupervised
inductive
learning
based
on
16
genetic
algorithm
problem
genetic
algorithm
problem
modified
problem
modified
genetic
algorithm
b) change GA techniques to
cope with the original decision variables
17
v11....v1l
x1
v11....v1l
(S * )
(S * )
x1
(S )
xn
encoding
vn1.....vnl
....
..
vn1.....vnl
S*
xn
decoding
(S )
S*
P(t)
P(t+1)
P(t)
decoding
encoding
S*
G(t)
G(t)
genetic operators
18
0 1 2
0 0 0
...........
l
0
0 1 2
1 0 0
...........
l
0
0 1 2
0 1 0
...........
l
0
0 1 2
1 1 1
...........
l
0
0 1 2
1 1 1
...........
l
1
u+3q
u+q
u
u+2q
v-q
................
19
Example:
Let us consider the encoding of x1,2 [2,2] by means of 4 bits per decision variable:
v u 2 ( 2)
q= l =
= 4 / 16 = 1 / 4 .
2
24
o code 0000 is associated with xi (2,2 + 1 / 4] ; at the evaluation stage, the decoding leads to
o If the optimum point is x1 = x2 = 2 + 1 / 16 , then the best result of the algorithm could be
x1 = x2 = 2 + 2 / 16 , with an error of 1 / 16 introduced by the finite length encoding. The error
could be decreased by using longer chromosomal strings (which lead to smaller q ).
Disadvantages:
o the accuracy of the algorithm depends to the length of the chromosome very
long chromosome are required to explore large, highly dimensional search spaces.
o the encoding can increase the complexity of the problem e. g. the problem
becomes multimodal (it admits multiple global optima);
This could happen whenever ordering relationship for the distances in S is not
preserved for the distances in S* (big distances between two individuals in S
does not mean big distances for the same individuals in S* and vice versa).
Solution: change the encoding! e. g. use Gray binary encoding.
Remark: For binary encoding, similitude (in S*) could be analyzed with Hamming
distance
20
B. Modify GA techniques
Modified genetic operators!!!!
the decision variables are not encoded: their values are directly memorized in the
chromosomes (S* is not used);
new genetic operators are needed to work in S (assuming infinite encoding
alphabet);
Advantages:
The chromosomal representation is more natural;
Additional knowledge can be more easily incorporated within the algorithm;
No need of extra computational time for decoding.
Remarks:
Which approach is best? there is not a general answer
B intermediary results can be easily interpreted;
A good theoretical background.
The encoding (A or B) must be joined with the compatible genetic operators.
21
22
Different genetic operators have been suggested - no available rules for choosing
the most suitable one.
Kursawe the genetic operator must be designed taking into account the
dimension of the search space.
Combine crossover and mutation.
23
Types of crossovers:
single cutting point crossover
multiple cut points crossover
- more efficient for exploration;
- random selection of the cutting points + others methods (e.g. avoid interchanging
identical sub-chains).
crossover using multiple parents: more than two parents participate to the production
of an offspring.
discrete crossover: 50% probability to select the gene from a certain parent, 50%
from the other parent.
Most popular ones: multiple cutting point crossover and discrete crossover
Mutation
- keeps the diversity of the offspring avoids the stagnation of the algorithm.
pm must be correlated with the employed selection.
Usually:
GA: pm small (rare mutation).
Large pm can disturb the algorithm convergence.
Example: If all the offspring survive implicitly to the next generation, the use of
pm > 1 / l ( l = the length of the chromosome) can lead to instability.
Recommendation - Bck (1996) for binary encoding: use Gray encoding with pm = 1 / l ,
with l = the length of the chromosome.
24
Decreasing pm.
Large pm. in the first generations in order to refresh the genetic material.
Small pm. in the last generation in order to allow algorithm convergence.
Crossover
simple crossover interchanges sub-chains randomly selected from the parents.
heuristic crossover if parent x2 is better than parent x1 ,
x1' = a ( x2 x1 ) + x2 ; a (0,1) random, scalar
25
intermediary crossover - parents x1 and x2 , produce the offspring x1' and x2' :
x1' i = ai x1i + (1 ai ) x2i
x2' i = (1 ai ) x1i + ax2 i
a = [ai ]i - vector of random values - having the same size with the chromosome; its
elements can be chosen from ( 0.25,125
. ).
th
x ji , j {1,2} indicates the i element of the chromosome x j .
Gene 1
x1
Area where
offspring
could be
placed
x2
(*)
the offspring are placed on a segment slightly larger than the one delimitated by the parents:
Gene 1
segment where
offspring could be
placed
x1
x2
Gene2
Mutation
Uniform mutation - changes the values of some randomly selected genes
The chromosome x = (v1 , v2 ,......., vk ,......vn ) is changed to x ' = (v1 , v2 ,......., v ' k ,......vn ) ,
when the mutation acts on vk.
The new value vk, is randomly chosen: vk' (vk a, vk + a), a > 0 .
This operator is very useful for populations containing multiple duplicates of the
same individual.
(t , y) =
, z ( 0,1) ; b N (usually b = 5 ),
T maximum number of generations.
This mutation has larger impact in the first generations, when the genes can
be mutated in larger intervals ( (t , y ) is bigger). During the last generations,
only small variations are allowed.
27
Mutation
mutation with adaptive pm (Smith, Fogarty) - pm is encoded in the chromosome.
28
Crossover
Usually, they are aimed at finding suitable cutting points.
Selection is also used for insertion. However, insertion will be separately analyzed, as
the specific mechanisms are enough different.
Attention should be paid to the fact that the GA works on a finite population, for a finite
number of generations.
29
objective
values
STEP I
fitness
compute the
selection
probabilities
(selection
probablities)
STEP II
parents
sampling
Recombination
pool
There are two main alternatives for computing the fitness values:
By explicitly using the objective values,
By considering the rank of the individual assigned in list sorted in terms of the
objective values.
30
f ( xi )
f ( xi )
, cu F ( xi ) = 1
i =1
with F ( xi ) = pi .
i =1
where
xi denotes the individual i of a population including N solutions,
f is the objective function.
Aims:
met the requirement above;
change the influence of the individuals during the evolutionary loop
F ( xi ) =
f * ( xi )
N
f ( xi )
i =1
31
Example:
if a = 1 and b is significantly bigger than the mean of f,
then f* converge more hardly than f.
p1 =
f * ( x1 )
10
f * ( xi )
125
0,11 - with scaling
50 + 1000
i =1
f * ( x1 ) = f ( x1 ) + b = 125 ,
10
10
i =1
i =1
32
Remark:
For some authors, f* is used as fitness, as the proportional scaling is implicitly provided
by the stochastic roulette based sampling.
o Scaling based fitness assignment gives very much credit to the individuals
considerably better than average. If a population includes an individual which is
significantly more adapted than the others, this individual will be selected with
more copies within the recombination pool, the offspring will be close to him, so
this individual has huge chances to conquer the whole population with its
duplicates. This impedes the exploration in larger areas and allows the algorithm
to remain locked in local optima.
This disadvantage is eliminated by ranking-based fitness assignment.
33
F (ri ) = q (ri 1) r ,
where ri is the rank of the individual xi, and q, r are parameters
The selection probabilities belong to an arithmetic series (with the step r).
To ensure that the sum of all selection probabilities is equal to 1, it results:
N 1 1
q=r
+ , where N is the size of the population.
2
N
When r = 0 (and q = 1 / N ), all the individuals get the same fitness, no matter what
performances they have.
(and q = 2 / N), the biggest difference is made between the
When
individuals placed on consecutive ranks. The worst chromosome is assigned with the
fitness value 0 and the best with the fitness value q = 2 / N .
q=
2( SP 1)
SP
and r =
,
N ( N 1)
N
34
selection
probability
rank
q ( N 1)r
Remark:
For some authors, the fitness values are not equal to the probabilities of selection,
N
nonlinear method
F (ri ) = q (1 q ) ri 1 ,
where ri denotes the rank of xi and q (0,1) .
The selection probabilities belong to a geometric series of rate (1 q) .
N
For any q (0,1) , the requirement F ( xi ) = 1 cannot be met. However, this sum
i =1
F ( xi ) =1 (1 q ) < 1; F ( xi ) 1
i =1
i =1
35
Disadvantages:
it does not met the requirements of schema theory (analyzing the GA
convergence);
it neglects the differences between the objective functions of the individuals with
consecutive ranks;
it requests to apriorically set two parameters (q, r) or one parameter (SP) of huge
impact.
36
o The spread indicates the range of the number of selections allowed for an
individual: [min_no_samples max _no_samples]
It measures the consistency of the method.
Small spread means that the real number of selected samples is close to
the expected number of samples.
Please note that a finite number of selection trials are considered. For
infinite trials, the number of occurrences would be compliant with the
selection probabilities; however for small number of trials, huge
differences can be met.
S UM
F(x 1 )
.
.
.
F( x1 )+ F(x2 )
37
Explanations:
Each individual gains a sector proportional to his fitness. The population forms a
circle with the length SUM (usually, SUM = 1 , although this requirement is not
mandatory).
Nsel random numbers are generated within (0, SUM), if Nsel indicates the number
of individuals needed in the recombination pool. Each selection corresponds to a
roulette turning; the individual sent to the recombination pool is the one indicated
by the position of the needle.
Obviously, a higher fitness value (assigned to a well adapted individual) conducts to a
higher sector and consequently to higher selection chances.
The method ensures:
null bias
large spread (0,Nsel)
(any individual with non-null fitness can be selected)
computational complexity order Nsel ln(Nsel ) .
38
39
F (v) F ( x)
T
, v wins the contest; otherwise x is the winner.
It is smaller than e
T is float number decreased during the evolutionary loop,
F(v) and F(x) are the fitness values.
Timeover
Let us consider a GA without genetic operators, which uses the selection for
recombination in order to directly form the population of the next generation. The size
of the population is kept constant.
The solutions better than average of the initial population are progressively
sampled with more and more copies.
After a finite number of generations the best individual conquers the whole
population with its duplicates and the algorithm stops.
This number of generations is called timeover.
40
Bck (1996) compared the selection based on scaling with rank-based selection:
Sampling is solved with a roulette-based method.
Scaling-based selections with linear, polynomial or exponentiation scaling lead
to a timeover of order N ln(N ) , where N is the size of the population.
The type of scaling has no huge influence.
Bck recommends:
rank-based selection and k-tournament selections
41
o The selection which ensures the survival of the best solution at every generation
is called elitist.
- for this type of selection, Rudolph proved the convergence towards the
optimal point.
42
A. 1. Methods specific to GA
- in order to reduce memory and computational time consumption: GA produces
fewer offspring than the population size.
- some offspring (the best ones) are inserted into the population.
size of the recombination pool/ size of the population = generation gap
it indicates the informational gain of the algorithm per generation .
The new information is achieved by exploiting the most valuable genetic material
inherited during the evolutionary loop.
- offspring are inserted at each generation, = constant.
43
44
This means that every generation offspring are inserted in the population by
eliminating parents.
Remark:
Schwefel recommends the (, N) insertion with >> N , however the (N+) insertion
has also proved its efficiency in numerous applications.
45
offspring are generated. For evolutionary strategies >> N , although the method can
be applied for N .
An intermediary population (having (N+) size) is formed, by reuniting the old
solutions and the offspring. Then, its best N individuals deterministically survive to the
next generation.
o The insertion is elitist the performances of the best solution are monotonically
improved.
o The survivors can be also stochastically selected from the intermediary
population, using k-tournament selection. Usually, each selected solution is
extracted (eliminated from the population).
Small values of k are preferred.
For large k, the selection is close to the deterministic one.
o The method ensures high convergence speed. It can be used in combination
with techniques able to preserve high diversity within the population.
46
Life time should be assigned by taking into account the performances of each individual
relative to the performances of the individuals included in the current population and/or
in the previous populations.
Better individuals should live for longer time intervals, thus having higher
chances to produce offspring inheriting their genetic material.
47
where
m and M represent the minimum and maximum life times, respectively,
F(xi) denotes the fitness value of xi individual
abs
abs
and Fmax
indicate the minimum and the maximum fitness values obtained from the beginning of
Fmin
the evolutionary loop.
Remark:
o It could be useful to use larger populations at the beginning of the algorithm (in
order to encourage the exploration), and smaller populations during the last
generations (when exploration is merely guided around some solutions).
2.9. GA convergence
The theory of GA is not capable to entirely explain the involved mechanisms.
GAs are very good, but we do not exactly know why they are so good.
o The convergence has been proved for particular GAs, for unrealistic assumptions,
such as infinite populations, infinite number of generations.
48
o Each schema contains constant and variable genes. For the previous example, the
schema is 0###.....##1. Here, # indicates the variable genes for which any allele is
permitted (in this case 0 or 1).
Therefore, the search can be viewed as the process which looking for the best adapted
schemata.
Holland stated that the fitness value of an individual gives partial
information about the adaptation capacity of the schemata belonging to
the individual.
Rephrasing, the fitness of schema H can be computed as the mean fitness
of the individuals containing instances of H.
49
The length, ( H ) = the length of the chain delimitated by the first and the last
constant genes -1
- e.g., schema 01**10*1 has the length 8 1 = 7 .
Schema Theorem: GA with linear scaling-based selection, simple crossover and rare
mutation encourages the multiplication of schemata better adapted than average, having
small lengths and small orders:
m( H , t + 1) m( H , t )
f (H )
1 N
f ( xi )
N i =1
[1 pc
(H )
l 1
( H ) pm ] , with
50
Proof
o Let us consider the scaling-based selection applied on P(t), in order to fill the
recombination pool with N samples.
o The expected number of selected samples for the individual xi having the fitness
f ( xi )
.
f ( xi ) : ni =
1 N
f ( xi )
N i =1
o After selection, within the recombination pool, the number of Hs instances is:
f (H )
m( H , t + 1) s = m( H , t )
, with
1 N
f ( xi )
N i =1
f (H ) - the fitness of H computed for P(t),
1 N
f ( xi ) - the mean fitness of P(t).
N i=1
This eq. indicates that the GA encourages the multiplication of schemata which are
better adapted than average.
51
o Holland proved that the number of the schemata which are efficiently processed
by GA at a certain generation are about N 3 , where N denotes the size of the
population.
Bertoni and Dorigo argued that Holland estimator is valid for populations
having the size proportional with 2l .
o However, GA can implicitly analyze significantly more schemata than the number
of the individual;: this behavior is called implicit parallelism.
52
o Using Banachs theorem, a useful result concerning the GA convergence has been
obtained:
A GA which is capable to improve the mean performances of its
population at any successive generations converges towards a fix
population (fix point).
Therefore, for any initial population, after an infinite number of
generations, a final fix population is obtained this population includes
optimal solutions, only.
Remarks:
o The theorem does not give any result concerning the convergence speed of the
algorithm.
Obviously, the convergence speed is influenced by the algorithm parameter:
(the size of the population, genetic operators probabilities, etc.) and by the
content of the initial population.
In real implementations, the number of generations should be finite, too.
53
Rudolph proves that a GA with elitist selection (which keeps the best solution within
the population) convergences towards the global optimum.
The requirement does not refer to the improvement of mean performances of the
population, yet to the survival of the best adapted individual, only.
Insightful explanations concerning the influence of selection and genetic operation were
delivered by Qi and Palmieri Let us consider a GA working on infinite populations for
optimizing bounded, positive, unimodal objective functions with a finite number of
discontinuities.
o If the initial population covers (continuously) the whole exploration space, the
scaling-based selection will encourage the individuals clustering towards the
regions characterized by the highest fitness values. The density of the solutions is
increased around the optimum point.
o So, the use of selection without genetic operators guarantees the convergence
towards the global optimum.
o This convergence is also proved for GAs working on infinite populations (which
continuously cover the search space) with scaling-based selection and mutation of
low magnitude or rare.
54
When working with finite populations, the initial population does not include all
the potential solutions of the exploration space, so the action of genetic operator
is crucial for refreshing the genetic material.
Also note that GAs involve a finite number of generations, so the convergence
speed is vital for the algorithm performances.
This convergence speed is dependent to all algorithm parameters.
2.10. Parallel GA
o Because GAs are time consuming, they are usually employed for offline
applications. The execution time depends on the size of the population, the
selection pressure, etc.
o Using smaller populations can lead to smaller execution time, at the cost of
reduced accuracy.
o A more valuable approach for reducing the execution time without altering the
other algorithm performances is to consider parallel implementations.
Three main directions could be depicted: global GAs, migration-based GAs and
diffused GAs.
55
MASTER
SLAVE 1
...............
SLAVE k
56
initialization:
t=0; chose N random individuals to form P(t);
repeat until t < No_Generations +1
for each subpopulation SbP(t) execute separately:
step 1: evaluate SbP(t)
step 2: selection fill the recombination pool of the subpopulation;
step 3: crossover- generate offspring using the parents selected at step 2;
step 4: mutation apply small variations on the individuals obtained at step 3;
step 5: evaluate the offspring resulted at step 4;
step 6 : insertion - create SbP(t+1), choosing N individuals form SbP(t) and from the
offspring obtained at step 5;
if migration is allowed:
step 1: chose r% individuals from each subpopulation (the best ones) - for migration;
step 2: establish the content of the subpopulations for the next generation, eliminating
the less adapted host individuals;
t=t+1;
end of the loop
display the best individual of the entire population;
end of the algorithm
57
ring migration
1
1
5
unrestricted migration
1
5
The implementations lead to good results if the best individuals of each subpopulation
are encouraged to migrate.
The emigrants can be also chosen from the offspring more offspring are produce at the
generations which involve migration.
Some offspring migrate to other subpopulations. Because they combine the
genetic material of well adapted individuals, their genotype can be valuable for
the host subpopulation.
58
!!!!! Usually, the GA working on a population having the size equal to the sum of
subpopulations sizes has worse results than the migration-based GA.
59
The initial population is random, uniformly distributed over the exploration space.
After several generations, some clusters could be observed, indicating regions where the
nodes contain similar individuals
Better adapted individuals tend to be spread over the population, thus
conquering a larger surface.
The functions should be scalable: the complexity of the optimization problem should be
tuned via some parameters.
60
f1 ( x) = xi2 ; x = [ x1 ..... xn ];
i =1
c 2
1 n 2
xi
n
i =1
1 n
cos( c3 xi )
n
i =1
+ c1 + e; x = [ x1 ..... xn ];
f 4 ( x) = ( Ai Bi ) 2 ; x = [ x1 ..... xn ];
i =1
j =1
j =1
C ( xi )
, pentru xi 0
j
2 D
1 cos(b x )
'
C
(
1
)
x
i
C ( xi ) =
, C ( xi ) =
j =
b ( 2 D ) j
1, pentru xi = 0
61
p1
...
pj
w1
wj
....
n
f (n)
f : , y = f (n ) - activation function
usually nonlinear;
wR
pR
Components:
- synapses or links characterized by weights (also called strengths);
- summing block and activation function (the activation function is usually nonlinear);
n = wi pi + b
i =1
with b , W 1 x R , p R x 1 .
62
R
~~
n = wi pi = Wp
i =0
notations: [w0
[1
not
~
w1 K wR ] = W (transposed) extended weight vector
not
p1 K
~T
pR ] = p
w0
=b
p1
...
w1
wj
pj
....
n
f (n)
wR
pR
~ T = [ p ...
b] , p
1
63
pR
R +1
~~
n
=
wi pi = Wp .
1] ,
i =1
Comparison between the artificial neuron (AN) and the biological one (BN):
3) energetic efficiency: BN 1016 J / sec per operation, AN 106 J / sec per operation.
4) BNN works asynchronously, without a clock master (continuous time domain).
5) BNN involves random connectivity; ANN uses specified connectivity.
6) BNN are tolerant to errors.
1, n 0
1) y = f (n) =
- hard limiter
0, n < 0
-b
with nn = wi pi
i =1
+1
+1
nn
-b
0
-1
Hard limiter
64
nn
2) y = f (n) = n - linear
f
+1
nn
, with nn = wi pi
R
i =1
3) y = f (n) =
1
, c > 0 - sigmoid
1 + exp(cn)
For c = 1 :
f
+1
0.5
-b
nn
R
, with nn = wi pi
i =1
0.6
0.6
a
0.8
0.8
0.4
0.4
0.2
0.2
0
-5
1
0
-5
intrare 5
w=-2<0; b=3>0
0.6
0.6
intrare
w=-2<0; b=-3<0
0.8
0.8
0.4
0.4
0.2
0.2
0
-5
0
5
-5
0
5
intrare
intrare
in punctul de inflexiune: a=0.5, p = -w/b, tangenta la grafic are panta= w/4
0
65
4) y = f (n) =
1 exp(2cn)
, c > 0 - hyperbolic tangent
1 + exp(2cn)
For c = 1 :
f
+1
-b
0
-1
nn
, with nn = wi pi
i =1
5) y = f (n ) = e
( n c )2
The nodes are connected by links which ensure unidirectional and instant
communication.
Hidden
layers
Input
layer
Output
layer
u1
y1
u2
y2
Legend:
Lateral links (between the
nodes of the same layer)
Feedback links (from the
output of a neuron to its
input)
um-1
yk-1
um
yk
AN N
outputs
ANN
inputs
Remark:
- the input layer does not perform any processing; it will be not count;
- the hidden layers and the output layer include neurons.
.
Types of ANN
b1
n1
w1,1
u1
y1
f 1( n )
w1, j
...
w1, m
uj
...
....
w k ,1
Neuro n k
um
wk , j
bk
wk , m
nk
f k (n )
Inp ut
layer
67
Ou tp u t l ayer - k ne uro ns
yk
...
...
...
~~
)
y = y l = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( Wu
...
...
...
wk ,1u1 + ..... + wk ,m u m + bk
y k f ( wk ,1u1 + ..... + wk ,m u m + bk )
with
w1,1
..
~
W = wl ,1
...
wk ,1
..
w1, m
... ...
... wl , m
... ...
... wk , m
...
bk extended weights for neuron k
u1
...
~
u = = input vector,
u m
1
y1
y = ... = output vector.
y k
Remark:
~
W = [ wi, j ], i = 1, k , j = 1, m + 1
for wi, j - the first index indicates the neuron
- the second index indicates the link
68
y1
u1
N euro n 1
...
uj
. ...
yk
um
N euro n k
O utp ut laye r
- k neuro ns
Inp ut
la ye r
b2
b1
w 11,1
n1
f 1(n )
y 11
w 12,1
f 2 (n )
1
u1
w 11, j
Neuron 1,
layer 1
...
w 11, m
u
n2
y 11
w2
...
...
....
w 1 s ,1
Neuron s,
layer 1
um
Neuron k,
layer 2
2
k ,1
s,j
b 2k
b1
s
w1 s ,m
n1
Inp ut
layer
Neuro n 1,
layer 2
1, s
y 1s
w k2,s
f s1 ( n )
y 2k
n k2
f k2 ( n )
69
- Layer 1:
y1
w1 .. w1
1
1,1
1, m
~1
1
1
1 ~ 1~
y = f ( W u) , with y = ... , W = .. ..
..
y1
w1 .. w1
s, m
s,1
s sx1
-
b11
u1
~
, u = ..
..
u
m
b1s
sx ( m +1)
1 ( m +1) x1
Layer 2
y2
w12,1 .. w12, s
1
~ 1
~
y 2 = f 2 (W 2 ~
y ) , with y 2 = ... , W 2 = ..
..
..
y2
w 2 .. w 2
k,s
k ,1
k kx1
b12
,
..
2
bk
kx ( s +1)
y11
y1
~
=
y1 = ..
1
y1s
1 ( s +1) x1
Remark:
y2
y1
u1
...
N e uro n 1,
la yer 1
N e uro n 1,
la yer 2
uj
. ...
um
Inp ut
la ye r
y 1s
N euro n s ,
laye r 1
H idd e n la yer
- s neuro ns
70
y2
k
Ne uro n k
layer 2
O u tp u t la ye r
- k ne uro ns
y1
u1
Neuro n 1
...
q- 1
q- 1
uj
....
yk
um
Neuro n k
Inp ut
layer
ANN architecture =
Outp ut layer
- k neuro ns
ANN parameters
sigmoid/linear/step =
weights
biases
Gaussian
centers
spreads
71
b1
w11,1
u1
v1
f 1 (v )
v2
y11
w11, j
w 1 ,1
w2
...
...
w1 s,1
Neuron s,
Layer 1
um
Neuron k,
Layer 2
2
k ,1
s,j
bk
w1s, m
v1
Inp ut
layer
Neuro n 1,
La yer 2
1, s
uj
....
f ( v)
1
Neuron 1,
Layer 1
1
w 1, m
y11
...
2
1
y1s
wk2,s
1
f s ( v)
y 2k
v 2k
f k2 ( v )
Characteristics:
o The layers are linked in series: the outputs of the neurons belonging to a layer are inputs for
o All the neurons have sigmoidal activation function (linear, sigmoid, tanh)
o The MLP can have any number of hidden layers
72
1. On-line
The training samples (u (i ), d (i )), i = 1, N are presented in sequel, one sample per iteration (the
number of iterations = multiple of N).
k
I (n) = 0.5 ei 2 (n) , with ei (n) = d i (n) yi (n) = the error of the ith output neuron
i =1
2. Batch
All the training samples (u i , d i ), i = 1, N are presented at a single iteration.
I (n) =
1 N k 2
ei (n, j ) ,with ei (n, j ) = the error of the ith output neuron for the jth training
2 N j =1 i =1
I
wijl
73
2
- For online learning ( I (n) = 0.5 ei (n) )
i =1
Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
1 N k 2
ei (n, j ) ,)
2 N j =1 i =1
where ei (n, j ) = the error of the ith output neuron for the jth training sample presented at the nth
epoch
wikl (n) =
1 N
l
wik (n, j ) -the mean of variations separately computed for each sample
N j =1
For the sake of simplicity, online learning is considered I (n) = 0.5 ei 2 (n)
i =1
ei (n) = d i (n) yil (n) , the error produced by the output neuron i
s
I
wijl
( n) =
e
e
y l
v l
I
(n) i (n) = ei (n) i (n) i (n) i (n)
ei
wijl
yil
vil
wijl
74
I
(n) = ei (n) (1) f ' i (vil (n)) y lj1 (n) = il (n) y lj1 (n)
l
wij
with
I
(n) = local gradient
vijl
Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
The parameters will be adapted starting from the output layer to the input layer.
Considering the hidden layer l, the local gradients within the layers l +1, l +2, ..etc. must be
available from previous computations.
The output of the neuron i belonging to layer l is input for the neurons belonging to l +1 (output).
75
y zl +1 (n) = f l +1 (v zl +1 ( n)) , z = 1, k ,
s
s = the number of input connections for the neuron i (the number of neuron with the previous layer,
l ), wlz+,01 = bzl +1 , y lz = 1 .
zl +1 known for z = 1, k .
q = the number of input connections for neuron i (the number of neurons belonging to the previuous
layer), wil,0 = bil , y 0l 1 = 1 .
76
k
e l +1
I
( n) =
( n) z ( n)
l +1
z =1 e z
wil, j
wil, j
k
y il
vil
e zl +1
y zl +1
v zl +1
I
l +1
n
e
n
n
n
n
n
(
)
=
(
)
(
)
(
)
(
)
(
)
( n)
z
wil, j
y zl +1
v zl +1
yil
vil
wil, j
z =1
k
I
(
n
)
e zl +1 ( n) (1) f l '+1 (v zl +1 ( n)) w zil +1 f l ' (v il (n)) y lj1 (n)
=
wil, j
z =1
k
k
I
'
l +1
l +1
l
l 1
l 1
(
n
)
w
f
(
v
(
n
))
y
(
n
)
y
(
)
zl +1 w zl +,i1 f l ' (v il ( n)) .
=
z
z ,i
l
i
j
j
wil, j
z =1
z =1
wil, j ( n) = y il 1 ( n) il ( n) ,
with
Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)
Remarks
o Sigmoid
f (v ) =
1
a exp(av)
, a > 0 f ' (v ) =
= a f (v) [1 f (v)]
1 + exp(av)
[1 + exp( av)]2
o Hyperbolic tangent
f (v ) = a tanh(bv), a, b > 0 f ' (v) =
b
[a f (v)] [a + f (v)]
a
77
For small values: low convergence speed; a quite smooth trajectory is followed within the
search space
Improvements:
n t
n t
t =0
t =0
wijl (n) = [ n
I
(t ) ,
wijl
I
I
I
(0) + n 1 l (1) + ... + l (n)] .
l
wij
wij
wij
when
I
wijl
I
wijl
(t ) keeps the sign at successive iterations, the absolute value of wijl increases
(t ) changes the sign at successive iterations, the absolute value of wijl
decreases
78
For online learning: training samples must be randomly presented to avoid cycling
Online learning
Convergence hardly to analyze (the examples must be randomly presented for avoiding the
stagnation in local optima)
5) Stop criteria
- only some recommendation can be made:
Recommendations:
o The norm of the gradient becomes close to 0
Disadvantage: numerous epochs can be involved.
o The variation of the criterion I becomes insignificant
Disadvantage: premature stop.
Outliers can impede the convergence and can lead to bad generalization capabilities.
79
Ex:
Symmetric limiter
Hyperbolic tangent: f (v ) = a tanh(bv) ,
recommended values (LeCun) a = 1.7159 , b = 2 / 3 , with
f
(0) 1.14 .
v
8) Learning rate
Usually, the gradients in the output layer are bigger, so should be smaller for the output
neurons.
80
Theorem:
Any continuous bounded function can be approximated with any desired degree of accuracy
> 0 , by means of a MLP containing
i =1
j =1
F (u) = i f ( wij u j + bi )
m = the number of hidden neurons
R = the number of inputs
the theorem does not give any indications concerning the resulted generalization capacity of the
model and the time requested for learning
- the value of m:
o if m is small, the empiric risk is lower (reduced risk to learn the noise captured by the
training samples);
o if m is large, a good accuracy can be obtained;
- when a single hidden layer is used:
o the parameters of the neurons tend to interact: the approximation of some samples can
be improved solely by accepting worse approximation for other samples
For ANNs with 2 hidden layers: the hidden layer 1 extracts the local properties
the hidden layers 2 extract the global properties
81
p1
c1
...
y=f(n)
n
cj
pj
....
cR
pR
See demorb1
y = f ( p c ) = f ( ( p1 c1 ) 2 + ... + ( p R c R ) 2 )
c1
p1
p = ... , c = ...
c R
p R
c = center vector (a center for each input connection)
p c
) = exp(
2 2
p1
c1
p = ... , c = ...
p R
c R
( p1 c1 ) 2 + ... + ( p R c R ) 2
2 2
82
Remarks:
The neuron is activated only if the input (vector) is similar to the center (vector).
o The accepted similitude level is given by .
o If is large, the neuron is activated for reduced similitude between inputs and centers.
For inputs which are very dissimilar to the centers, the neuron is inactive:
y 0, for p c >> 0 p, c very different.
RBF architecture
Standard architecture includes
an output linear neuron
a single hidden layer with s Gaussian neurons.
-
Because a single hidden layer is considered the upper index will be deleted for almost of
the notation (it was only kept for making distinction between the linear and the radial
basis activation functions).
n1
y1 = f1(n1)
u1
....
uj
um
ni
cij
....
w1
...
ci1
yi = f1 (ni)
b
n
wi
cim
....
ws
ns
ys = f1 (ns)
83
f2 (n) = y
y = f ( w1 y1 + .. + ws y s + b) = w1 y1 + .. + ws y s + b = [w1
2
y1
s
.. ws ] ... + b = wi yi + b
i =1
y s
u = ... , c i = ...
u m
cim
c i = center vector for the hidden neuron i
For Gaussian activation functions within the hidden layer:
2
s
(u ci1 ) 2 + ... + (u m cim ) 2
) = b + wi exp( 1
)
i =1
i =1
2 i 2
2 i 2
ci = center vector for the hidden neuron i
i = spread for the hidden neuron i
s
y = b + wi exp(
u ci
f1 (u)
Let us consider f : , s large , with f (u) = ... , f1 ,..., f s : m .
f s (u)
(e.g. f1 ,... f s indicate the mappings provided by s hidden neurons)
m
84
Definition
The classes C1 ,C 2 are f-separable, if there exist w = [w1 .. ws ]T s with:
Remarks:
-
1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
2
u2
0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0
u1
f2(u)
f1(u)
85
Find F : m accepting (u (i ), d (i )) , i = 1, N ,
with u(i ) = [u1 (i ) .. u m (i )]T m and d (i )
d (i ) = F (u(i )) = the desired output of the function corresponding to input u(i )
F (u) = wi f i ( u u(i) )
i =1
the number of radial basis functions = the number of the training samples;
the functions f i accept the centers c i = u(i ) .
a) f i (u ) =
b) f i (u ) =
u ci
1
u ci
c) f i (u ) = exp(
+ qi
u ci
2 i 2
) : local, bounded
86
..
..
f1 (u ( N )) ...
f N (u (1)) w1 d (1)
.. = .. .
..
f N (u ( N )) w N d ( N )
Let us consider:
f1 (u (1)) ..
..
= ..
f1 (u ( N )) ...
f N (u (1))
..
= interpolation matrix.
f N (u ( N ))
w1 d (1)
w1
d (1)
1
.. = .. .. = .. - if is nonsingular.
wN d ( N )
wN
d ( N )
Remarks:
Remarks:
o large N (many samples) many radial basis functions complex model (over-fitting)
o large N (large samples) risk of poorly conditioned interpolation matrix and large execution
times
87
F (u) = wi f i ( u u(i ) ) ,
i =1
F (u) = b + wi f i ( u ci ) :
i =1
The centers of the radial basis functions and the input training samples are different.
..
..
f1 (u ( N )) ...
f s (u (1)) 1 w1 d (1)
.. .. = .. .
..
w
f s (u ( N )) 1 s d ( N )
b
Let us denote:
f1 (u (1)) ..
G =
..
..
f1 (u ( N )) ...
f s (u (1))
..
f s (u ( N ))
1
N x ( s +1)
.
Therefore, it results:
w1 d (1)
d (1)
w1
G .. = .. .. = G .. ,
w
ws d ( N )
d ( N )
s
b
b
G + = (G T G ) 1 G T
88
1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
2
0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0
For the above mentioned input training samples it results:
f1 (u(1)) = 1, f 2 (u(1)) = 0.13
0.13 1
1
1 1
f1 (u(4)) = 0.13, f 2 (u(4)) = 1
0.13
Let us define: d (1) = 0, d (2) = 1, d (3) = 1, d (4) = 0
Therefore, it results: -
0
w
1 2.439
1
w = G + = 2.439
1
2
2
.
7561
b
0
89
Theorem:
Any continuous bounded function F : m can be approximated with any desired degree of
accuracy by means of:
s
u ci
i =1
F (u) = b + wi f (
), > 0 , if
The requirements imposed by this theorem are met for the radial functions b), c).
F (u) = b + wi exp(
i =1
uc
2 2
) , when the same spread is employed for all the hidden neurons
Recommendation: Chose s = 3 N
Remark:
the ANN with hidden radial basis activation functions and a linear output neuron is compliant
with the requirements of the previous theorem.
-
w
f1 (u ( N )) ... f s (u ( N )) 1
d ( N )
s
b
If the centers are known, the output weights can be computed in a single step.
generalization = interpolation
90
RBF
MLP
Any number of hidden layers
Global action
Local action
Learning strategies
1. Random centers selection
Step 1. Chose the centers randomly (uniformly distributed over the input range).
d
Step 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.
91
2. Centers self-organization
Step 1. The centers are chosen via clustering of training input samples (e. g. K-mean clustering)
K-mean clustering (tip: learn via competition):
Step 1-0: Chose random distinct initial values for all s centers, denoted c i (n) , with n = 0 and
i = 1, s .
Step 1-1: For the training sample u(n) , compute u(n) c i , i = 1, s and find the minimum
distance, which indicates the nearest center for this sample. Consider i , with i 1, s , the nearest
center.
Step 1-2: Update the nearest center, moving it towards the sample:
c i (n + 1) = c i (n) u(n) c i (n) , cu 1 > > 0
Step 1-3: n n + 1
Step 1-4: If some training samples have not been used yet, or the change made at step 1-2 is too
large,
Go to step 1-1.
d
Step 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.
92
I=
s
u( j ) c i 2
1 N
1 N
2
)]
e( j ) = [ d ( j ) wi f (
2 j =1
2 j =1
i
i =1
At each iteration, the parameters of RBF (weights, centers, spreads) are updated according to the
following rules:
- for weights:
wi wi 1
I
, with
wi
2
N
N
u( j ) c i
u( j ) c i
I
= e( j ) f (
) = e( j ) exp(
)
i
wi j =1
j =1
2 i 2
- for centers:
ci ci 2
I
, with
c i
2
u( j ) c i
wi N
I
=
) (c i ) k [u( j ) k (c i ) k ] ,
e( j ) exp(
(c i ) k i2 j =1
2 i 2
93
- for spreads:
i i 3
I
, with
i
N
u( j ) c i
u( j ) c i
w N
I
= e( j ) f (
) = i e( j ) exp(
i
i j =1
2 i 2
4 i 3 j =1
) u( j ) c i
4. Constructive algorithm
- insert the hidden neurons in sequel the center vector copies the input training sample that produces
the highest output squared error for the current architecture
> see MATLAB
ANN
robustness
capacity of inductive learning (supervised or unsupervised)
high computational capacity
parallelism
+
AG
robustness, flexibility
scarce a priori information concerning the objective function
- supportive
reduced cooperation between GA and ANN.
The methods are sequentially, separately applied considering two distinct subproblems, or are independently used for solving the same problem.
- collaborative
strong cooperation between GA and ANN.
These combinations exploit more advantageously the merits of the involved
techniques.
95
C1. GA used for preparing the input data for the neural classifiers:
feature(input) selection:
Aim: improve the recognition rate and the execution times via the selection of few
relevant features.
Assuming the binary encoding, a locus can indicate the use/absence of a feature.
The drawbacks result from the fact that the method involves large computational time,
as the evaluation of each chromosome demands training the corresponding classifier.
o Chang & Lippman obtained 80% reduction of features in a voice recognition problem.
o Guo & Uhrig designed a diagnosis system based on neural observers for a nuclear plant. The
GA decided which are the inputs of each observer (given a large set of thousands available
variables).
Aim: ANN should have few inputs, and should be precise, so the objective function can be
defined as follows:
0.7 3 t +1
0.15(t +1)
f ( x) = (e ( z 1)
) (1 e 0.01err
) , where
x denotes the chromosome which has to be evaluated,
z=
no. of variables
,
no. of selected variables
Assuming the binary encoding, 1/0 indicates the use/the absence of the corresponding
plant variable.
A similar problem was solved by Weller.
96
f ( x) = y yd , where
x denotes a chromosome,
y represents the neural output corresponding to x,
yd indicates the target output (minimum, maximum, threshold).
97
o GA used for
training the ANN
and/or
select the ANN topology.
>> better accuracy and better generalization capabilities.
A.
Unlike gradient based training, GA learning is robust and reduced the risk of stagnation
in local optima.
Genetic training can be also used for ANNs with non-derivable activation functions or
recurrent connections.
Genetic training involves large computational times, however better convergence speed
can be achieved via hybridization with local optimizations.
Usually, the objective function is the mean output error computed for the whole
training data set.
Competing conventions
-
99
step 1: Initialize the minimal neural topology (2 layers: input and output).
step 2: Adapt the weights, for N _ ep epochs, by means of genetic learning.
step 3: Test if ANN accuracy is convenient E < E0 (E = output squared error):
Yes go to 8.
No continue with 4.
step 4: Insert a new hidden neuron, denoted with N.
C1 = the set of Ns input connections (coming from the neural inputs and the other hidden
neurons),
C 2 = the set of Ns output connections.
Initialize the weights of these links with random values close to 0.
step 5: Adapt the weights for C1 and the bias of N, by maximizing the covariance between
the outputs of the hidden neurons and the squared output error; the genetic
algorithm is applied for N _ ep _ 1 epochs (let C denotes the best objective value).
step 6: Adapt the weights of C 2 , by minimizing the output squared error; the genetic
procedure is applied for N _ ep _ 2 epochs.
step 7: Go to step 3.
step 8: Stop.
100
Advantages:
- the parameters of a single neuron are trained at each stage;
- the algorithm constructs the neural topology too (without genetic techniques);
- the hybridization CC- GA allows the selection of simpler topologies, at the cost
of increased computational time.
Improvements of CC
o Potter:
The weights of C2 are found by selecting the additive values belonging to (0,-C),
which correct the best adapted individuals obtained at the precedent neuron
insertion.
f ( x) = err + g T g , where
err denotes the output squared error,
g is the vector of weights,
101
Topalov:
-
multiple sequences of training, each one consisting of genetic training followed by back-propagation; the maximum
number of commutations (sequences) is preset.
Ng:
- apply back-propagation; if the output squared error is too big and its variation during the previous epochs is
insignificant, then a GA is used to guide the search far away from the local optimal point.
GA aims the minimization of the output squared error and uses Gaussian mutation.
Large permits too longer stagnations, whilst small can generate false alarms.
Ku:
- train the recurrent neural networks by means of genetic diffuse algorithms.
The chromosomes are organized according to a matrix-based topology.
Crossover acts between neighbors, only.
B.
expected
lower
Better understanding of the neural representation find ways for modeling symbolic
knowledge.
102
Expected noise:
- the performances of the ANN will be influenced by the initial random values
of the neural parameters.
- the performances of the ANN will be influenced by the training algorithm;
to eliminate this drawback, GA can also work on the neural parameters.
103
Correctness
control the impact of the genetic operators (e.g. acting on the links, specific inner
structures, etc.).
Examples
Braun & Zagorski: consider a term describing the complexity of the encoded
ANN within the objective function.
Improved genetic operators: e. g. the neurons which are deleted are stored for
further potential insertions.
Dasgupta, Mann: hierarchical encoding higher levels include control genes,
the leaves correspond to parametric genes.
Changes performed within the upper levels correspond to significant alterations
of the encoded neural architecture. Changes performed within the lower levels
correspond to less significant alterations of the encoded neural architecture.
Thierens: use a canonical representation which eliminates the effects produced
by the symmetries of the activation functions, the permutations of neurons/links,
etc. To fit this end, several transformations are made.
Negative biases are changed to positive ones, and additionally the sign of all
the incoming corresponding weights is also changed.
Then, the neurons are sorted in terms of bias.
105
Sato & Nagaya , Sato & Ochiai: use matrix based encoding for evolving the
neural architecture and the neural parameters for ANNs with binary weights..
Romaniuk: genetic CC for selecting the architecture of neural classifiers
o Recurrent ANNs with sigmoidal activation functions are considered.
o The algorithm starts with a simple structure and adds new structures which
are genetically configured. The blocks which were already inserted remain
unchanged at the next steps.
o Additionally, the resulted topologies are simplified by deleting the less
important links.
o The significance of a link is established in the following manner: in sequel,
each connection is deleted and the response of the ANN is evaluated for all
the input training samples, and the new resulted faults are counted. Small
counters indicate insignificant links.
Liu i Yao: select the architecture and the parameters of generalized neural
networks.
o These ANNs include both sigmoidal and Gaussian neurons.
parametric
Encode the parameters which describe the architecture, such as: the number of layers, the
number of neurons within a layer, the type of accepted connections. Usually, this encoding
refers to a limited number of possible topologies.
106
o The algorithm was used for ANNs with binary and float weights. Good results were
obtained for symmetric ANNs.
o
The neural architecture is build via a cellular division process. The algorithm starts with a root
cell. Each cell possesses internal registers for storing the weights and the bias. The proposed
language indicates the following actions: cellular division, the transformation of a cell to a
neuron, choosing the values of the neural parameters, including delays, including recurrence.
A chromosome is evaluated by considering the first k members of the family, for which the
topology p + 1 gives better results than the topology p, p 1, k 1 , and the topology k + 1 is
worse than the topology k. The objective function is equal to the sum of output squared errors
corresponding to the first k members.
o The genetic operators act solely on the first member of the family. Both crossover and mutation
can be used. Higher probabilities are assigned for changing a symbol to recurrence of vice versa
are assigned. The individuals can be improved by Lamarckian local optimization.
A.
Yao & Liu produce the delivered neural architecture by combining the genetic
material of the individuals included in the last population.
o Various types of combinations were suggested.
o The resulted ANN has better performances at higher computational costs.
107
B.
Smith & Cribbs: ANN with binary weights and hard limiter activation functions.
- the population includes structural blocks which need to be aggregated in order to
build the ANN (a chromosome encodes a structural block).
- when the population contains multiple copies of an individual, a single copy is
used within the neural structure.
Output layer
Weights set without GA
Blocks improved
with GA
NNCrom1
NNCrom2
----
NNCrom N
Blocks encoded by
the chromomes
1 / n2 .
-
If the response is incorrect, then all n3 chromosomes having the output 1 are awarded with the
fitness 1 / n3 ; all the chromosomes having the output 0 cannot participate to error correction,
therefore their fitness will not be changed.
High fitness is assigned to a structural block which is useful for many samples or which is
the main contributor for specific samples.
108
REFERENCES
Affenzeller, M., Winker, S., Wagner, S., Beham, A. (2009). Genetic Algorithms and
Genetic Programming Modern Concepts and Practical Application. Boca Raton, FL,
CRC Press, 157-207.
Angeline P. J., Saunders G. M., Pollack J. B., (1994). An Evolutionary Algorithm that
Constructs Recurrent Neural Networks, IEEE Transactions on Neural Networks, 1 (5),
55 - 65.
Ashlock, D. (2006). Evolutionary Computation for Modeling and Optimization,
Springer, New York.
Baluja S. (1996). Evolution of an Artificial Neural Network Based Autonomous Land
Vehicle Controller, IEEE Transactions on Systems, MAN and Cybernetics - part B, 26
(3), 450 - 463.
Bck T., Fogel D., Michalewicz Z. (2000). Evolutionary Computation 2. Advanced
Algorithms and Operators, Institute of Physics Publishing, USA.
Barton, A. J., Valds, J. J., Orchard, R. (2009). Neural networks with multiple general
neuron models: A hybrid computational intelligence approach using Genetic
Programming, Neural Networks, 22, 614-622.
Bengio S., Bengio, Y., Cloutier, J. (1994). Use of Genetic Programming for the Search
of a New Learning Rule for Neural Networks, Proc. of Conference on Evolutionary
Computation, USA, 324 - 327.
Benuskova L., Kasabov N. (2007). Computation Neuro-genetic Modeling, Springer, New
Zealand.
Bonarini, A. , F. Masulli, G. Pas (2003). Soft Computing Applications (Advances in Soft
Computing), Physica Verlag Heildeberg.
Braun H., Zagorski P., (1994). ENZO II a Powerful Design Tool to Evolve Multi-layer
Feed Forward Networks, Proc. of Conference on Evolutionary Computation, USA, 278 283.
Coello Coello, C.A., Lamont, G.B., Van Veldhuizen, D.A. (2007). Evolutionary
Algorithms for Solving Multiobjective Problems, 2nd Edition. New York, NY: Springer,
50-150.
Da Ruan (1997). Intelligent Hybrid Systems, Kluwer Academc Publisher, USA.
De Jong, K. A . (2006). Evolutionary Computation - A Unified Approach. Cambridge.
MA: MIT Press.
DiMattina, C. (2010). How to Modify a Neural Network Gradually Without Changing
Its Input-Output Functionality, Neural Computation 22, 147.
Dumitrache I., Buiu C. (1995). Introduction to Genetic Algorithms, Ed. Politehnica,
Bucuresti, Romania.
Ferariu L.(2005). Algoritmi evolutivi in identificarea si conducerea sistemelor,
Politehnium, Iasi, Romania.
Ferariu L. (2010). Sisteme neurogenetice, Politehnium, Iasi, Romania.
109
110