Sunteți pe pagina 1din 48

Abstract

For the scaling of particle swarm to be considered as one of linear


complexity, the average number of evaluations would be expected to
increase as dimensionality increases while maintaining a high success
rate. This research shows that while particle swarm scales linearly
in problems of lower dimensional complexity, it is less eective on
problems of a higher dimensional complexity.
On the Complexity Of Scaling Particle Swarm Algorithm
Kingsley E Osime
September 23, 2011
Contents
0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.1.1 Global Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.1.2 Components of an Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.2 Scaling Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.2.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.2.2 Basic Genetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
0.2.3 Paralellisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.2.4 Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.2.5 Operators in Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
0.2.6 Scaling Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
0.3 The Particle Swarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.3.1 Sociocognitive Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.3.2 Continuous PSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.3.3 Binary PSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
0.3.4 Neighborhood Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
0.3.5 Particle Swarm Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
0.3.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
0.4 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
0.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
0.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
0.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
List of Figures 37
List of Tables 38
1
0.1 Introduction
Evolutionary Algorithms (EA) are search and optimisation methods inspired by the process of evolution and
natural selection. The power of evolution in nature is evident in the diverse species that make up our world,
with each tailored to survive well within its own niche. An explanation of biological diversity and its underlying
mechanism can be found in Darwins theory of evolution. Natural selection or survival of the ttest plays a
central role in what is sometimes called the microscopic view of evolution. The discipline of molecular genet-
ics oers a microscopic view of natural evolution by shedding light on the processes below the level of visible
phenotypic features. In genetics, a phenotype is an organisms observable characteristic features or traits as
opposed to a genotype which is the genetic make up of an organism, or an individual usually with reference to
a specic character or observable feature of the said individual. One may describe the fundamental observation
from genetics as each individual consisting a dual entity: its phenotypic (outside) properties are represented at
a lower genotypic (inside) level; put another way, an individuals genotype encodes its phenotype.
The application of Darwinian principles to problem solving dates back to the late 1940s when Turing rst
proposed genetical or evolutionary search, and by 1962 Bremermann implemented computer experiments on
optimisation through evolution and recombination [10]. Since then, many dierent implementations of the
basic idea have emerged. In the 1960s, evolutionary programming was proposed by Fogel, Owens and Walsh,
while genetic algorithms was developed by John Holland and Rechenberg and Schwefel developed evolution
strategies. By the early 1990s, a fourth stream called genetic programming[23] was pioneered by Koza.
By 1995 Kennedy and Eberhart developed particle swarm optimisation[9]. These problem solvers belong to a
class of population-based metaheuristics that have been applied successfully to many combinatorial problems
in the not too distant past.
Scaling EAs to large-scale optimisation problems has historically attracted much interest in both theoretical
and experimental studies because they appear in many real-world problems. The earliest practical approaches
to scaling an EA involved paralellising an existing genetic algorithm [26]. Parallelisation is a divide and conquer
approach that can be applied to genetic algorithms in a number of dierent ways, for instance using a distributed
software model[1]. These days researchers are limited to using test problems in individual studies in the absence
of a systematic evaluation platform[5]. However most analytical and experimental results using this methodology
have been obtained using low-dimensional problems i.e. between 1-30 dimensions [6][28][34][15][20][8]. Unfor-
tunately attempts to scale EAs to larger dimensions i.e 50 to 100 - 1000 dimensions have met with mixed
results though scientists have recorded some success using the cooperative coevolution [30] approach to scaling
an EA[39].
EAs are said to lose their ecacy when applied to large and complex problems, in other words, their per-
formance suer from what is known as the curse of dimensionality. This is to say that their performance
deteriorates quickly as dimensionality increases. The reasons for this appears to be two-fold. Firstly, the com-
plexity of the problem increases with the size of the problem, and secondly, the solution space of the problem
increases exponentially with the problem size, requiring a more ecient search strategy to explore all promising
regions within a given time budget.
The particle swarm algorithm takes its inspiration from a simple sociocognitive theory [17] and is based on the
simulation of a simplied social model [18]. It has been successfully applied to both real-valued[36][16][33] and
binary [19] optimisation problems. The performance of the particle swarm algorithm is said to be aected by
its neighborhood topology [21][11][20]. Researchers have conducted detailed analytical studies of its trajectory
which is said to oscillate around the mean of a particles previous best position and the best position of some
other particle[28][6].
Available literature on the scalability of the particle swarm algorithm can not be considered as extensive to any
degree, however particle swarm has been found to have serious issues in larger and more complex dimensions
[27]. The aim of this project is to ascertain as to whether the scaling of the particle swarm algorithm is a
problem of linear complexity. Particle swarm algorithm is tested on two benchmark functions [5] (unimodal and
multimodal) with increasing dimensionality and a xed population and the relationship between the average
best nal value and the average number of function evaluations required is analysed. The remainder of this
chapter explores general themes around metaheuristics and problem solving. In Chapter 2, a discussion on
previous attempts at scaling EAs is undertaken. Chapter 3 discusses the particle swarm in detail and chapter
4 is reserved for discussing experiments undertaken.
2
0.1.1 Global Optimisation
Optimisation is the process of making something better. It consists primarily in trying variations on an initial
concept and using information gained to improve the idea. To be more precise, optimisation is the process of
adjusting inputs to or characteristics of a device, mathematical process, or experiment to nd the minimum
or maximum output. The input consists of variables; the process or function is known as the cost function,
objective function or tness function and the output is the cost or tness. To put this in context, an analogy can
be made to real life where we are confronted daily with numerous opportunities for optimisation. For instance,
deciding the best route to work, or what time to get up in the morning so that we maximise the amount of
sleep yet make it to work on time. Optimisation can thus be viewed as a decision problem that involves nding
the best solution. The term best (or global optimum) solution suggests there may be more than one solution
and the solutions may be of unequal value. The denition of best is relative to the problem at hand, its method
and the individual formulating the problem.
In pure mathematical terms, a general optimisation problem can be represented in the following way[4]:
Denition 1 A general single-objective optimisation: problem is dened as minimising (or maximising) f (x)
subject to g
i
(x) 0, i = 1, ..., m, and h
j
(x) = 0, j = 1, ..., p x . A solution minimises (or maximises)
the scalar f (x) where x is an n-dimensional decision variable vector x = (x
1
, ..., x
n
) from some universe .
Note that g
i
(x) and h
j
(x) are constraints that must be satised while optimising (minimising or maximis-
ing) f (x). contains all possible x that can be used to satisfy the evaluation of f (x) and its constraints.
Further, x can be a vector of continuous or discrete variables and the function f can also be of the discrete or
continuous kind. Consequently, one may dene the global minimum of a single objective problem as follows:
Denition 2 A general single-objective global minimum optimisation: Given a function f : '
n
', =
., for x the value f

f(x

) > is called a global minimum if and only if


x : f(x

) f(x)
x

is by denition the global minimum solution, f is the objective function and the set is the feasible region
of x. The astute reader may notice the judicious use of the term single objective optimisation. This is
in deference to the depth and the varying level of complexity in the eld of optimisation. It is important to
recognise the distinction that though single objective optimisation problems may have a unique optimal solution,
there exists a separate class of problems where one may nd (as a rule) possibly uncountable set of solutions.
An oversimplied classication of optimisation algorithms into six categories, all of which can not be said to be
mutually exclusive is given [13]:
Function vs Trial and error - Trial and error optimisation is an approach preferred by experimentalists.
It refers to the process of adjusting variables that aect the output without knowing much about the
process that produces the output. In contrast, a mathematical formula describes the objective function
in function optimisation where manipulations of the function lead to the optimal solution. This approach
is mostly preferred by theoreticians
Single variable vs. Multiple variable - a single variable means the optimisation is one-dimensional, as
opposed to a problem having multiple variables requiring multidimensional optimisation. Optimisation is
known to become increasingly dicult as the number of dimensions increase
Static vs Dynamic - Dynamic optimisation refers to the output as a function of time, in contrast to static
optimisation which is exactly the opposite; this is to say it is independent of time
Discrete vs Continuous - Discrete variables have only a nite number of possible values, in contrast to
continuous variables that have an innite number of possible values. Discrete optimisation is also known
as combinatorial optimisation, because the optimum solution consists of a certain combination of variables
from a nite pool or of all possible values.
Constrained vs Unconstrained - Constrained optimisation incorporates variable equalities and inequalities
into the cost function while unconstrained optimisation allows variables to take on any value. When
constrained optimisation formulates variables in terms of linear variables or linear constraints, the problem
is called a linear program. The opposite is the case when cost equations or constraints are nonlinear.
Ramdom vs Minimum seeking - Minimum seeking algorithms are traditional optimisation algorithms
usually based on calculus methods. They tend to be fast but usually get stuck in local minima. On the
other hand, random methods use probabilistic calculations to nd variable sets. They tend to be slower
but have a greater success at nding the global minimum.
3
Evolutionary Algorithms belong to a class of random population based metaheuristics. They are said to be global
in nature in the sense that they attempt to search throughout the entire feasible set. An Evolutionary algorithm
requires both an objective and tness function, which are fundamentally dierent. The objective function denes
an EAs optimality condition (and is a feature of the problem domain) while the tness function (in the algorithm
domain) measures how well a particular solution satises that condition and assigns a corresponding real-value
to that solution[7]. These functions are however identical in principle.
0.1.2 Components of an Evolutionary Algorithm
The history of the eld of evolutionary algorithms is replete with the development of many variants. The uni-
fying idea behind most of these variants is however the same: given a population of individuals, environment
pressure instigates natural selection or survival of the ttest, this in turn causes a rise in the tness of the popu-
lation. Given a function to be optimised, we randomly create a set of candidate solutions; this is to say we create
elements of the functions domain and apply the quality function as an abstract tness measure. On the basis
of this tness, the better candidates are chosen to seed the next generation by applying recombination and/or
mutation. Applying recombination and mutation leads to a new set of candidates (ospring) that compete -
based on their tness (and possibly age) - with older candidates for a place in the next generation. This process
is iterated until a candidate with sucient quality (solution) is found. EAs have a number of components, proce-
dures or operators that must be specied in order to dene a particular EA[10] and not all of these are applicable:
Representation (Denition of Individuals)
The rst step in designing an EA is deciding how to represent it. It involves linking the real world to the
EA world. Objects representing possible solutions within the original problem context are referred to as
phenotypes, while their encoding i.e. individuals within the algorithm are called genotypes. Representation
amounts to specifying a mapping from phenotypes onto a set of genotypes that are described as representing
the phenotypes. For example, one could represent a problem of integers (phenotypes) by their binary code
(genotypes). The evolutionary search itself takes place in the genotype space and a good solution is obtained
by performing an inverse mapping from genotype to phenotype in a process referred to as decoding. To each
genotype, there must be at most one corresponding phenotype.
Evaluation(Fitness Function)
The evaluation function forms the basis of selection and therefore facilitates improvements. From the problem
perspective, it represents the task to be solved in an evolutionary context. Technically, it is a function or
procedure that assigns a quality measure to genotypes. Typically, an evaluation function is composed from a
quality measure in phenotype space and has an inverse representation. The evaluation function is variously
called the tness function in EC terminology and also objective function in the original problem context.
Population
Given a representation, dening a population can be as simple as specifying how many individuals are in it, that
is, setting the population size. A population is a multiset of genotypes and forms a unit of evolution. In some
EAs, a population may have additional spatial structure with a distance measure or neighborhood relation.
The diversity of a population is a measure of the number of dierent solutions present within the population.
There is no single measure for diversity, however, it might be convenient to refer to diversity in terms of the
number of dierent tness values present or the number of dierent phenotypes or genotypes present. In most
EA applications, the population size is a constant and does not change during evolutionary search.
Recombination
Recombination or crossover is a variable operator that merges information from two parent genotypes into one
or two ospring. It is a stochastic operator in deciding what parts of each parent to combine, and the way
in which to combine those parts. The principle behind recombination is simple - by mating two individuals
with dierent but desirable features, an ospring that combines both those features can be produced. Recom-
bination operators in EAs are usually applied probabilistically, that is, with an existing chance of not being
performed. Recombination is never used in particle swarm optimisation or evolutionary programming, however
it is the main search operator in genetic algorithms and often the only variation operator in genetic programming.
Mutation
A mutation operator is a stochastic unary variation operator that is applied to one genotype and delivers a
(slightly) modied mutant, the child or ospring. In general, mutation is supposed to cause a random unbiased
change and its role and implementation varies according to EA variant; for instance, it is rarely used in genetic
programming, while in genetic algorithms it is traditionally seen as a background operator to ll the gene pool
with fresh blood, in evolutionary programming it is the one and only variation operator and in particle swarm
4
it does not even exist.
Initialisation
The initialisation of an evolutionary algorithm can be completely at random, or can incorporate human or other
expertise about solutions that may work better than others. Incorporating human or other expertise about the
problem or problem domain amounts to using a-priori knowledge with the aim of generating an initial popula-
tion with a higher tness.
Termination
If a problem has a known optimal tness level, then reaching this level or value can be used as a stopping
condition. However, EAs are stochastic in nature and there is no guarantee of reaching this optimum, hence the
condition might never be satised and consequently the algorithm might never stop. This requires the condi-
tion be extended with one that guarantees to stop the algorithm. Some of the most commonly used conditions
include for instance, the total number of function evaluations reaching a certain limit or the total number of
iterations.
Each of these components must be specied in order to dene a particular EA. Furthermore, to obtain a running
EA, an initialisation procedure and a termination condition must also be specied.
5
0.2 Scaling Evolutionary Algorithms
Scaling EAs is a worthwhile endeavour that has occupied the best minds from various scientic elds as a
result of the immense benets situated therein. This chapter discusses two dierent variants of evolutionary
algorithms and looks at the methods historically employed to scale these algorithms to handle larger and more
complex problems.
0.2.1 Genetic Algorithms
Genetic algorithms (GA) are search methods based on principles of natural selection and genetics[23]. They
begin like any other optimisation algorithm, by dening the variables to be optimised, the cost function and the
cost. It ends like other optimisation algorithms by testing for convergence. In between, however, the GA is quite
dierent from other optimisation algorithms. Genetic algorithms are said to encode the decision variables of a
search problem into nite length string of alphabets of a certain cardinality. In other words, strings, without
loss of generality, are considered to be constructed over a binary alphabet for instance V = 0, 1. In [12],
Goldberg explains how it would be possible to motivate a schema by appending a special symbol to this ternary
alphabet. A schema is a similarity template describing a subset of strings with similarities at certain string
positions. In general, for alphabets of cardinality (number of alphabet characters) k, there are (k+1)
l
schemata.
The strings which are considered candidate solutions to the problem are referred to as chromosomes. For
example, the seven-bit string A = 1001110 may be represented symbolically as follows:
A = a
1
a
2
a
3
a
4
a
5
a
6
a
7
Each alphabet a
i
is said to represent a single binary feature or detector and are collectively referred to as
genes and the values of these genes are called alleles and may take on a value of 1 or 0.
GAs work with the coding of parameters, rather than directly with the parameters themselves. An important
concept in GA as with any evolutionary algorithm is the notion of population. Genetic algorithms directly
manipulate a population of strings in a straightforward manner. This explicit processing of strings causes the
implicit processing of many a schemata during each generation. The size of the population is understood to be a
user specied parameter and is one of the important factors aecting the scalability and performance of genetic
algorithms. For instance, a small population size might lead to premature convergence and yield substandard
solutions, while a large population size might unnecessarily lead to proigate expenditure of valuable compu-
tational time and resources. A simple genetic algorithm that yields good results in many practical applications
is composed of three operators - reproduction, crossover and mutation.
The GA begins by dening a chromosome or an array of variable values to be optimised. For instance,
if the chromosomes has N variables (an N-dimensional optimisation problem) given by p
1
, p
2
....p
N
then the
chromosome can be expressed notationally as an N element row vector:
chromosome = [p
1
, p
2
, p
3
..., p
N
]
To evolve good solutions and implement natural selection, a good measure for distinguishing good solutions
from bad solutions is required. The measure could be an objective function or cost function which may be
a mathematical model, game or even an experiment. A cost function generates output from a set of input
variables. The term tness is extensively used in GA literature to designate the output of the objective function.
In essence, the tness measure must determine a candidate solutions relative tness which will subsequently be
used by the GA to guide the evolution of good solutions. For instance, searching for a maximum elevation on
a topographical map might require a cost function with input variables longitude (x) and latitude (y) dened
thus:- chromosome = [x, y], where N = 2. A cost is found for each chromosome by evaluating the cost function
f, at p
1
, p
2
, ..., p
N
cost = f(chromosome) = f(p
1
, p
2
, .., p
N
)
The cost function is written as the negative of the elevation in order to put it into the form of a minimisation
algorithm [13].
f(x, y) = elevation at (x, y)
The following steps describe how GAs evolve solutions to the search problem[23]:
1. Initialisation. The initial population of candidate solutions is usually generated randomly across the search
space. Domain specic knowledge or other such information may be incorporated to help the search.
2. Evaluation. After initialization, tness value of candidate solutions are evaluated.
6
3. Selection. Selection allocates more copies of those solutions with higher tness values and thus maintains
a survival of the ttest mechanism. The whole point of selection is to prefer better solutions over worse
ones. There are many selection procedures in literature notably roulette wheel selection, ranking selection,
tournament selection. A brief discussion on selection can be found in the next section.
4. Recombination. Recombination (also called crossover) combines parts of two or more parental solutions to
create new, possibly better solution (ospring). Under recombination, the ospring will not be identical
to any particular parent and will instead combine parental traits in a novel way.
5. Mutation. Mutation locally but randomly modies a solution. There are many variations of mutations,
but it usually involves one or more changes being made to an individuals trait or traits.
6. Replacement. The ospring population created by selection, recombination, and mutation replaces the
original parental population. Many replacement techniques in GA include elitist replacement, steady state
replacement etc.
0.2.2 Basic Genetic Operations
In this section, some of the selection, recombination and mutation operators commonly used in genetic algo-
rithms are described in a bit more detail.
Selection
Selection is a very important concept in genetic algorithms. It is modeled after Darwins Theory of evolution
where survival of the ttest translates to discarding the chromosomes with the highest cost. Then only the
best are selected to continue while the rest are deleted. Two chromosomes are selected from the mating pool
of N
keep
chromosomes to produce two new ospring. This pairing continues in the mating population until
enough ospring are born to replace the discarded chromosomes. There are a number of methods in literature
used to perform selection in GA. In [23], Goldberg et al state that selection procedures can be broadly classied
into two classes: Fitness Proportionate selection which includes methods such as roulette-wheel and stochastic
universal selection and secondly ordinal selection where you have tournament selection and truncation selection.
For the purposes of this project, a less entangled generalisation is adopted to include[13]:
1. Pairing form top to bottom - Chromosomes are paired two at a time beginning from the top until the
top N
keep
chromosomes are selected for mating. Thus the algorithm pairs odd rows with even rows, an
approach not known for its exact likeness to nature but very simple to code.
2. Random Pairing - This approach uses uniform random generators to select chromosomes for mating. For
instance, from a practical point of view, we would implement this technique by writing:
m
i
= ceil(N
keep
rand(1, N
keep
))
p
i
= ceil(N
keep
rand(1, N
keep
))
where ceil rounds the value to the next highest integer.
3. Weighted random pairing - Probabilities are assigned to chromosomes in the mating pool which are
inversely proportional to their costs. This is another way of saying a chromosome with the lowest cost has
the highest probability of mating, while a chromosome with the highest cost has the lowest probability of
mating. There are two techniques to point out here: rank weighting and cost weighting.
(a) Rank Weighting - A problem independent approach which nds the probability from the rank n, of
the chromosome:
P
n
=
N
keep
n + 1

N
keep
n=1
n
the cumulative probabilities of each chromosome could be used, if greater than the random number,
to select chromosomes from the mating pool. If a chromosome were to become paired with itself,
an alternative would be to randomly pick another chromosome. The randomness in this approach is
more indicative of nature and probabilities only have to be calculated once as they do not change
with each generation.
(b) Cost weighting - With this approach, the probability of being selected is calculated from the cost
of the chromosome instead of its rank in the population. A normalized cost is calculated for each
7
chromosome by subtracting the lowest cost of discarded chromosomes C
N
keep+1
from the costs of all
chromosomes in the mating pool:
C
n
= c
n
c
N
keep+1
Subtracting c
N
keep+1
ensures all costs are negative . The probability P
n
is calculated from
P
n
=

C
n

N
keep
m
C
m

This approach tends to weight the top chromosome more when there is a large variation in the cost
between top and bottom chromosome. However when all chromosomes have the same cost, it tends
to weight them evenly.
4. Tournament Selection - This approach closely mimics mating competition in nature and works by a
random selection of a small subset of s chromosomes from the mating pool (either with or without
replacement), which are then entered into a tournament against each other. The ttest individual in the
group of k chromosomes wins the tournament and is selected as parent. Put another way, the chromosome
with the lowest cost in this subset becomes a parent. The most widely used value for the subset s of
individuals is 2. With this selection scheme, n tournaments are required to choose n individuals as parents.
Tournament selection works best for larger population sizes because sorting becomes time-consuming for
large population
Recombination (Crossover)
After selection, individuals from the mating pool are recombined (or crossed over) to create new and hopefully
better ospring. Many recombination operators described in literature are problem specic however some of the
more generic ones are described here. For most recombination operators, two individuals are randomly selected
and recombined with a probability p
c
called the crossover probability. Meaning a uniform random number r is
generated and two individuals undergo recombination if r p
c
, otherwise if r > p
c
, the two copies are simply
copies of their parents. The value of p
c
can be set experimentally or can be set based on schema theorem
principles. Some recombination techniques include[23]:
1. K-point Crossover:- The simplest and most widely used type of crossover is the one-point and two-point
crossover methods. Depending on which method i.e one-point or two-point, crossover sites are randomly
selected over the string length and the alleles between the two sites are exchanged between the two
randomly paired individuals. The concept of one-point crossover can be extended to k-point crossover
where k number of crossover points are used rather than just one or two.
2. Uniform Crossover:- With this approach, every allele is exchanged between a pair of randomly selected
chromosomes with a certain probability, p
e
, known as the swapping probability usually set at a value of
0.5.
3. Uniform Order-Based Crossover:- A recombination method specically developed for search problems with
permutation codes. In uniform order-based crossover, two parents P
1
and P
2
are selected randomly and
a random binary template is generated. To create an ospring C
1
, some of the genes for C
1
are lled by
taking the genes from one parent P
1
where there is a one in the template. The remaining gaps in the
ospring C
1
are lled by sorting the genes of parent P
1
in the positions corresponding to zeros in the
template in the same order as they appear in parent P
2
. The sorted list is used to ll the gaps in ospring
C
1
. Ospring C
2
is created in the same manner.
There are several other permutation based operators not discussed here, however it is important to note that for
hard problems, most of these operators are described in [23] as not scalable but may be useful as a rst option
to scaling genetic algorithms to larger and more complex dimensional problems. Recently, however, researchers
have achieved signicant success in designing scalable recombination operators that adapt linkage, nevertheless,
the focus of this project will be limited to an overview of a technique known as paralellisation.
0.2.3 Paralellisation
Evolutionary algorithms have been applied successfully to many numerical and combinatorial optimisation prob-
lems in recent years. EAs, including genetic algorithms are generally able to nd good solutions in reasonable
amounts of time, but as they are applied to harder and bigger problems, there is an increase in the time required
to nd adequate solutions. One of the earliest practical approaches to scaling evolutionary algorithms to large
8
scale optimisation problems involved parallelising a genetic algorithm. The basic idea behind most parallel
programs is to divide a huge task into chunks and to solve the chunks simultaneously using multiple processors.
This divide and conquer strategy can be applied to GAs in a number of dierent ways and there are numer-
ous examples in literature of successful parallelised implementations. Some parallelisation methods use a single
population while others divide the population into several relatively isolated subpopulations.
Some methods are described in literature as having exploited massively parallel computer architecture, others
still prove to be ecient and better suited to multicomputers with fewer and more powerful processing ele-
ments. Parallel GAs can be classied into three main types: (1) global single population master slave GAs, (2)
single-population ne grained, and (3) multiple population coarse grained GAs. A summary of these techniques
is given in [29]:
1. Master-Slave Parallelisation - Also known as global parallel GA because selection and crossover con-
sider the entire population. It is often referred to as distributed tness evaluation and was one of the rst
successful application of parallel GA. This approach utilizes a single population and the evaluation of indi-
viduals and/or application of genetic operators are performed in parallel. In master-slave model, selection
and mating is performed globally, hence each individual may compete and mate with each other. Because
it normally requires only the knowledge of the individual being evaluated, the evaluation of the tness
function is the operation that is most commonly parallelised. It is usually implemented using master slave
programs, where the master stores the population and the slaves evaluate the tness, apply mutation, and
sometimes exchange bits of genome i.e crossover. Parallelisation of tness evaluation occurs by assigning
a fraction of the population to each of the available processors. The algorithm is said to be synchronous,
if the master stops and waits to receive tness values for all the population before proceeding to the next
generation. The opposite would be the case for the asynchronous version which does not stop to wait for
any slow processors.
The popularity of the parallel GA was down to a number of reasons. For one thing, there was a similarity
between parallel GA and simple (serial) GA as the former was simply an extension of the latter. It was,
in eect, as simple as taking a few conventional serial GAs, running them on each node of a parallel
computer and at some predetermined intervals exchange a few individuals. They were easy to simulate
with a network of workstations or even on a single processor using free software. The relatively little extra
eort needed to convert a serial GA into a multiple-deme GA ensured parallel GAs popularity continued
to grow in the late 1980s and early 1990s when researchers began to explore alternatives to make it faster
and to understand better, how it worked. These eorts led to the emergence several important issues.
For instance, it was obvious that parallel GAs were very promising interms of gains in performance, but
they were also more complex. Specically, the migration of individuals from one deme (subpopulation)
to another is controlled by several parameters: (a) the topology that denes the connections between
subpopulations, (b) a migration rate that controls how many individuals migrate, and (c) a migration
interval that aects the frequency of migration.
2. Coarse grained algorithms - A general term for a subpopulation model that together consists the
entire population; also described in literature as being a relatively small number of populations with
many individuals. Sometimes known as distributed GAs because they are implemented on distributed
memory computers. Basically, a GA is run on each subpopulation and exchange of information between
demes (migration) occurs eventually. Here a critical issue is how migration among demes should be
implemented. For instance, the event that compels migration, number of individuals migrated as a result
and communication topology are all important factors to be considered in this strategy [31]. These models
are characterised by the relatively long time they require for processing a generation within each deme,
and by their occasional communication for exchanging individuals.
3. Fined grained algorithms - Are said to require a large number of processors because the population is
divided into a large number of small demes. Also described as having only one population with a spacial
structure limiting interaction between individuals. In essence, exactly one individual is assigned to each
processor and genetic operations take place in parallel among adjacent processors allowing the individual
in each processor to be replaced by the new ospring after each generation. The topology of the network
strongly determines the behaviour of this GA. The local nature of genetic operations allows for a natural
diversity in many applications. From an implementation point of view, this model is viewed as a simple
extension of a serial GA implemented on massively parallel machines.
One of the earlier, more easily visible attempts to scale ne-grained parallel GAs was the work undertaken by
H. Muhlenbein [26]. Muhlenbein asserts the rational for his work by alluding to the system theory of evolution.
He describes his ne grained asynchronous parallel genetic algorithm (ASPARAGOS), which used a population
structure that resembled a ladder with upper and lower ends tied together [29], as adding two extensions to
classical genetic algorithms i.e. performing selection locally in a neighborhood and secondly, each individual
9
doing its own local hill climbing. These two extensions according to Muhlenbein help to overcome the problem
of complex epistatic interactions common in combinatorial optimisation for genetic algorithms. One may in-
stantly recall the issue of migration discussed earlier and how communication topologies aect the behaviour of
a GA. Muhlenbein applies his ASPARAGOS algorithm to a general combinatorial optimisation problem - the
quadratic assignment problem (QAP).
The QAP is a fundamental combinatorial optimisation problem in the category of the facilities location prob-
lems. The problem can be found in for example VLSI module placement, design of factories and process -
processor mapping in parallel processing. The QAP can be formulated as follows[26]:
Denition - Let two n x n matrices A, B be given. Let x
ij
denote:
x
ij
=

1 if process i is on processor j
0 otherwise
Then the QAP can be dened as:
min
n

i=1
n

j=1
n

k=1
n

l=1
a
i,k
b
j,l
x
ij
x
kl

i
x
i,j
= 1

j
x
i,j
= 1
The QAP is an NP-complete problem where one of a size n has n! dierent placements. The problem is
represented genetically by arranging the processors on a chromosome where the genes specify which process is
placed on the processor. Each process is placed only once in a valid phenotype; a constraint which produces a
nonlinear epistatic interaction by virtue of the fact that only a fraction of all possible genotypes (n
n
) gives a valid
phenotype. Local hill climbing of individuals is done with a technique known as the simple 2-opt exchange.
The exchange is accepted if it decrements the function to be optimised. Along with other arrangements, a
recombination operator called p-sexual voting recombination is dened the rationale for which can be found
embedded within the strong similarity found between solutions.
Implementing the neighborhood was done on a 64-bit processor system with distributed memory that allowed
the interconnection of the processor to be congured to t the application. A neighborhood is implemented
where each individual has four neighbors including the global best with a weight of 2 and a population size
equal to the number of processors. The algorithm was tested on two versions of the traveling salesman problem,
steinbergs problem and at least one other quadratic assignment problem.
Muhlenbein noted that the time per generation increases with the problem size and a hill climbing algorithm
which grows only linearly with the problem was used to handle larger problems. The algorithm was run
for 10 runs using 16, 32 and 64-bit processors. 64-bit processors were observed to give the best results by
far. Signicantly, Muhlenbeins simple asynchronous parallel genetic algorithm (ASPARAGOS) found a new
optimum for Steinbergs problem which was one of the largest QAPs to have been published at the time.
In [1], Goldberg et al present an innovative approach to scaling genetic algorithms which makes for compelling
reading. MapReduce is a programming model developed by google that enables users to easily develop large
scale distributed applications. It is analogous to a course-grained parallel GA implementation. It was inspired
by the map and reduce primitives present in functional languages. According to Goldberg et al, the associated
implementation parallelises large computations by executing each map function invocation independently from
every other invocation and using re-execution as the primary mechanism for fault tolerance. The computation
inputs a set of key/value pairs and produces a set of output key/value pairs. The user of the MapReduce library
expresses the computation as two functions: Map and Reduce.
Both Map and Reduce are functions written by the user and are internally represented as a set of key/value
pairs. The MapReduce framework has a mechanism whereby it takes each input pair and produces a set
of intermediate key/value pairs and then groups together all intermediate values associated with the same
intermediate key I and passes on to the Reduce Function. The Reduce function accepts the intermediate key
I and a set of values for that key and then merges these values to form possibly a smaller set of values. The
intermediate values are supplied to the Reduce function via an iterator which enables the model to handle lists
of values too large to t in main memory. It is possible to conceptualise the map and reduce functions supplied
by the user as having the following types:
map(k
1
, v
1
) list(k
2
, v
2
)
reduce(k
2
, list(v
2
)) list(v
3
)
10
in other words, the input keys and values are drawn from a dierent domain than the output keys and values.
Furthermore, the intermediate keys and values are from the same domain as the output keys and values.
The Map invocations are distributed across multiple machines by automatically partitioning the input data
into a set of M splits while the Reduce invocations are distributed by partitioning the intermediate key space
into R pieces using a partitioning function according to a default Hadoop conguration which is an opensource
implementation of the MapReduce model. Goldberg et al transform and implement a simple model of genetic
algorithms using MapReduce to encapsulate each iteration of the GA as a separate MapReduce job.
The MAP function evaluates the tness of a given individual, keeps track of the best individual and writes
it to a global le in the Distributed File System. The client reads these values from all mappers at the end
of each MapReduce and checks to see if the convergence criteria has been satised. The default partitioner
is overridden to allow for one which shues dierent individuals across dierent reducers. In the REDUCE
function, tournament selection without replacement is implemented in a manner where the winner is randomly
selected among S individuals in a tournament. The process is repeated population number of times. For this
implementation S is set to 5 and crossover is performed using two consecutively selected parents.
Implementation of the simple genetic algorithm is also split into phases. In order to reduce the time taken
for serial initialisation of the population, the initial population is created in a separate MapReduce phase in
which the MAP generates random individuals and the REDUCE is the identity Reducer. The pseudo-random
number generator is seeded with the current time and bits of the variables in the individual are compactly
represented in an array of long long ints. Due to the inability of expressing loops in the MapReduce model,
each iteration consisting of Map and Reduce are executed until the convergence criteria is satised.
Experiments were carried out using the ONEMAX problem (bitcounting) which is a simple problem con-
sisting in maximising the number of ones of a bitstring. The problem can be formally dened as nding string
x = x
1
, x
2
, ..., x
N
, with x
i
0, 1 that maximises the following equation:
F(x) =
N

i=1
x
i
The problem was implemented in Hadoop and was run on 52 nodes with each node having capability for running
5 mappers and 3 reducers in parallel. Multiple experiments were conducted and dierent performance measures
taken. For each experiment, the population for the GA was set to n log n with n being the number of variables.
Of particular interest is the set of experiments where scalability was observed by increasing problem size. This
was done by increasing the number of variables with the implementation scaling to n = 10
5
variables while
keeping the population set to n log n. The time per iteration was observed to increase sharply as the number
of variables increased to n = 10
5
as the population increased super-linearly (n log n), which is more than 16
million individuals.
Other performance measurements observed scalability with a constant overall load where the problem size
was xed at 50000 variables and the number of mappers were increased. The time per iteration was observed
to decrease as more and more mappers were added. Thus adding more resources and keeping the problem size
xed decreases the time per iteration. Finally, scalability was observed with constant load per node where the
load was set to 1000 variables per mapper. In this test, the time per iteration was observed to increase initially
and then stabilising after about 75 seconds.
Even though GAs have been shown to have tremendous potential for parallelisation, up to the present time,
it is still common practice to implement GAs in a serial fashion for medium complexity problems. Numerous
challenges have been identied in literature which limit an ecient implementation of GA over modern par-
allel computing paradigms. Majority of these challenges have to be overcome outside the GA community for
instance the High-Performance Computing community need to among other things provide a way to deal with
heterogeneous resources as well as providing easy and reliable methods to access remote archival and real time
data sources[2].
0.2.4 Evolutionary Programming
Evolutionary programming (EP) originally conceived by Lawrence J Fogel in 1960 is a stochastic optimisation
strategy of a similar genre to genetic algorithms. It was originally developed to simulate evolution as a learning
process with the aim of generating articial intelligence. At the time, intelligence was perceived to be the
capability of a system to adapt its behaviour in order to meet some specied goals in a range of environments.
The capability to predict the environment was considered to be a prerequisite for adaptivity and consequently,
intelligent behaviour [10].
In a classical example of EP, predictors were evolved in the form of nite state machines (FSM). A nite
state machine in its most basic form is simply a behavioural model used to design programs composed of a nite
number of states associated to transitions. In the context of these academic exhortation, one may consider an
FSM as implemented by Fogel as a transducer that can be simulated by a nite alphabet of input symbols and
11
can respond in a nite alphabet of output symbols. A nite state machine consists of a number of states S and
a number of state transitions which dene the inner workings of the FSM. The FSM could be trained to learn
simple prediction tasks. An example could be guessing the next input symbol in an input stream; for instance
considering n inputs predict the (n+1)th input, and articulate this prediction by the nth output symbol, in
essence requiring the input alphabet and output alphabet to be the same. In this case, performance of a FSM
is measured by the percentage of inputs where input
n+1
= output
n
.
In Fogels experiments, predictors were evolved to tell whether the next input (being an integer) in a sequence
is a prime or not. The tness of an FSM was dened in terms of its prediction accuracy on the input sequence of
consecutive integers. No parent selection took place, however each FSM in a given population is mutated once to
generate new ospring. Five generally usable mutation operators were used to generate new FSM: (a) changing
an output symbol, (b) changing a state transition, (c) adding a state, (d) deleting a state, and (e) changing the
initial state. A choice from these mutation operators is made randomly using a uniform distribution. There is
no recombination (or crossover) and the model, after creating ospring from a population of FSMs saves
the top 50% of their union as the next generation. The results obtained with this system, showed that after
approximately 200 symbols, the best FSM is an opportunistic one. For clarity, the strategy is good enough for
accuracies above 81% and provides empirical proof that a simulated evolutionary process is able to create good
solutions for an intelligent task.
For historical reasons, EP has long been associated with prediction tasks with the use of nite state machines
as their representation. However, since the 1990s EP variants for optimisation of real valued parameter vectors
have become standard. EP is considered a very open framework in terms of representation and mutation
operators. It is suggested in literature that representation should not be xed in advance or in general, but
derived from the problem to be solved. Conventional optimisation by EP can be summarised into two major
steps [24]:
1. Mutate the solutions in the current population
2. Select the next generation from the mutated and current solutions
These two steps are regarded as a population based version of the classical generate-and-test method where
mutation is used to generate new solutions (ospring) and selection is used to test which of the newly generated
solutions should survive to the next generation. A review of these important concepts is considered in subsequent
sections after a brief look at tness and representation.
0.2.5 Operators in Evolutionary Programming
Fitness Evaluation and Representation
EP is most frequently used for optimising functions of the form f: '
n
' and uses a straightforward oating
point representation where 'x
1
, ..., x
n
` '
n
are said to represent individuals. In (Fogel 1992b), a concept of
meta-evolutionary programming (meta-EP) is presented which turned out to be similar to the self-adaptation
paradigm of standard deviation in evolutionary strategy (ES). It amounts to adding strategy parameters to
individuals in a manner not dissimilar from ES. Fogels version required incorporating a vector v '
+n
of
variances rather than standard deviations v
i
=
2
i
as strategy parameters by extending the space of individuals to
I

= '
n
'
+
n
. However this was said to lead to problems where the scheme frequently generated negative and
therefore invalid values for the ospring variance. For this reason the boundary condition rule

i
<
0

i
:=
0
is usually called into play. The general form of individuals in evolution programming is given as[10]:

x
1
, ..., x
n
. .. .
x
,
1
, ...,
n
. .. .

The tness values (a) are obtained from objective function values f(x) by scaling them to positive values and
possibly imposing some random alteration : (a) = (f(x), ), where denotes the scaling function[3].
Mutation
As far as EP is concerned, there is no one single mutation operator. The choice is most often determined by
the representation as is often the case in GA. For the sake of coherence, the discussion on mutation is limited
to the progression from standard EP to meta-EP[3][10].
Mutation is said to transform a chromosome 'x
1
, ..., x
n
,
1
, ...,
n
` into 'x

1
, ..., x

n
,

1
, ...,

n
` where:

i
=
i
(1 + N(0, 1)) (1)
x

i
= x
i
+

i
N(0, 1) (2)
12
Here N(0,1) denotes a Gaussian distribution with a zero mean and a standard deviation of 1 and 0.2.
In standard EP, the Gaussian mutation operator m

{
1
;....
n
:
1
;....
n
}
: I I, m

{
1
;....
n
:
1
;....
n
}
(x) = x

uses a
standard deviation which is obtained for each component x
i
as the square root of a linear transformation of the
tness value (a) = (x), i.e (i 1; ....; n). Equation 1 is rewritten to satisfy the initial premise:

i
=

i
(x) +
i

i
and
i
are the proportionality constants and the osets respectively which must be tuned for a particular
task. To overcome tuning diculties, meta-EP self adapts n variances per individual in a manner similar to ES
where mutation m

{}
: I

, m

{}
(a) = (x

, v

) works in the following manner (


i
1; ..., n):
x

i
= x
i
+

v
i
N
i
(0, 1)
v

i
= v
i
+

v
i
N
i
(0, 1)
here denotes an exogenous parameter ensuring that v
i
remains positive. Whenever by means of mutation a
variance becomes negative or zero, it is set to a small value > 0. Many other mutation schemes have been
proposed including one in which the step size is inversely related to the tness of the solutions. Other ideas
from ES have also informed the development of EP algorithms including a version with covariance of matrices,
R-meta-EP is also in use.
Recombination
A recombination operator combining features of dierent individuals occurring in a population or anything
similar to the recombination operator in genetic algorithms is not used in evolutionary programming. In EP,
each complete solution is generally regarded as producing a population such that the mutation operator simulates
all changes that transpire between one generation and the next. Most of the arguments against recombination
should thus be viewed as conceptual rather than being the result of any technical constraints.
Selection
In EP, every member of the population produces exactly one ospring via mutation. In this way it diers from
genetic algorithms where selective pressure based on tness is applied at the stage in the process. After creating
ospring from parent individuals by mutating each parent once, a variant of the stochastic q-tournament
selection selects individuals from the union of parents and ospring in a round-robin format; this is to say a
randomised ( + ) selection is used. Fundamentally, for each individual a
k
P(t) P

(t) where P

(t) is the
population of mutated individuals, q individuals are chosen at random from P(t) P

(t) and compared to a


k
with respect to their tness values. For each comparison, a win is assigned if a
k
is better than its opponent.
Then counting how many of the q individuals are worse than a
k
results in a score w
k
0; ....; q; Typically q =
10 is recommended. After doing so for all 2 individuals, the individuals are then ranked in descending order
of the score values w
i
(i 1; ...; 2) and the individuals with the highest score w
i
are selected to form the
next population. More formally (i 1, ...2) where the highest score w
i
is denoted as:
w
i
=
q

j=1

1 if (a
i
) (a
j
)
0 otherwise
where
j
1, ..., 2 is a uniform random variable, sampled at each comparison. As the tournament size q
increases, the mechanism becomes more and more, a deterministic ( + ) scheme.
0.2.6 Scaling Evolutionary Programming
One of the more prominent attempts to scale EP to problems of a higher dimension i.e. 100 - 1000 decision
variables involved using a method called fast evolutionary programming (FEP) with cooperative coevolution
(CC)[39].
In [24],Yao and Liu were motivated by the slow convergence rates of classical EP on functions of a multimodal
nature and came up with what they called Fast EP as a viable alternative. Fast EP is exactly the same as the
classical EP except instead of a mutation operator based on a Gaussian distribution, it uses a mutation based
on Cauchy random numbers. To see this, they replace equation 2 as follows:
x

i
(j) = x
i
(j) +
i
(j)
j
(3)
where
j
is a Cauchy random variable with scale parameter t = 1 and is generated anew for each value of j.
From their studies, they observe that Cauchy mutation is more likely to generate ospring further away from
13
its parent than Gaussian mutation due to its long at tails. According to Yao and Liu, the Cauchy mutation
is expected to have a higher probability of escaping from a local optimum or moving away from a plateau,
espiecially when the basin of attraction of the local minimum or the plateau is large relative to the mean step
size. Yao and Liu conduct extensive empirical studies of both standard EP or classical EP (CEP) and FEP to
determine the relative strength and weaknesses of both variants for dierent problems. They used twenty-three
benchmark functions to aid their eorts to understand when and why FEP was better than CEP and cite
Wolpert and Macreadys No free lunch theory to justify their use of an unusually large number of functions.
The chosen functions were of various peculiarities and varied in diculty level.
Functions f
1
to f
13
were high dimensional problems, within that, functions f
1
- f
5
were unimodal, function
f
6
was described as a step function with one minimum and a discontinuous surface, function f
7
was said to
be a noisy quartic function, where random [0,1) is a uniformly distributed random variable in [0,1). Functions
f
8
- f
13
were multimodal functions where the number of local minima increases exponentially with problem
dimension while functions f
14
- f
23
were low dimensional functions which have only a few local minima.
The experimental set-up used identical parameters for both FEP and CEP with a population size of = 100,
tournament size of q = 10 for selection, the same standard deviation for Gaussian mutations = 3.0 and the
same initial population generated uniformly at random in a specied range.
On unimodal functions the average results of 50 independent runs were taken showing the progress of the
mean best solutions and the mean of average values of population found by CEP and FEP for f
1
-f
7
. FEP was
observed to outperform CEP in terms of convergence rates although CEP displayed better nal results over f
1
and f
2
. FEP was noted to be weaker than CEP in ne tuning. In multimodal functions with dimensions all set
to 30, FEP performed signicantly better than CEP as CEP appeared to become trapped in local minima and
unable to escape from it due to a smaller probability of making long jumps. FEP appeared to converge at least
at a linear rate with respect to the number of generations. The authors observed an exponential convergence
rate for some of the problems. Generally, the results showed that the Cauchy mutation is an ecient search
operator for a large class multimodal function optimisation problems.
Cooperative coevolution is an approach to modeling the coevolution of cooperating species. It is a way to
represent and solve larger and more complex problems by introducing explicit notions of modularity in order
to provide reasonable opportunities for complex solutions to evolve in the form of interacting co-adapted sub-
components. Potter and De Jong [30] combine and extend contemporary knowledge in this area by applying
their ideas to genetic algorithms in the following ways :
1. A species or subpopulation should represent a subcomponent of a potential solution.
2. Complete solutions should be obtained by assembling representative members (individuals) of each of the
the species present in the solution
3. Credit assignment should be dened in terms of the tness of the complete solutions in which members
of subpopulations participate
4. The number of subpopulations should evolve when required
5. The evolution of each subpopulation is handled by a standard evolutionary algorithm
There are three step in cooperative coevolutionary algorithms. The rst step is to decompose a large system
into many sub-systems or modules, which are smaller and easier to design and manage. The second step requires
evolving the modules in some populations. The third step is to reconstruct the whole system from the modules.
The second and third steps can be repeated for many generations. Tests were conducted to compare the per-
formance of coorperative coevolution GA (CCGA-1) with that of standard GA on some highly multimodal test
functions including Rastrigrin, Schwefel, Griewangk and Ackley. The only dierence between the two was the
question of the utilisation of multiple species. Both algorithms were terminated after 100,000 function evalua-
tions. In sum, in all cases, CCGA-1 was found to signicantly outperform the standard GA both in minimum
value found and in speed of convergence to zero.
This same Yao and his colleagues adopt cooperative coevolution techniques in [39] to scaling Fast Evolution-
ary Programming to high dimensional problems. In their implementation Yao et al use an approach dierent
from that in FEP to estimate the tness of an individual in FEP cooperative coevolution (FEPCC). As stated
earlier, in cooperative coevolution each individual may be a component of a potential solution. In FEPCC,
because each individual is a component in a vector, and other components remain unchanged, evaluation is
limited to calculating the dierence caused by the changed component. The tness of an individual in FEPCC
is estimated by combining it with the current best individuals from each of the other populations to form a
vector of real values, and applying that vector to the target function.
Eight benchmark functions were used in experiments where f
1
to f
4
were unimodal functions and f
5
to
f
8
were multimodal functions having the number of local minima increase exponentially with the problem di-
mension. The experimental set-up consisted of a population size = 50, tournament size q = 10 with the
14
initial population generated uniformly at random in a specied range. The rst set of experiments on unimodal
functions f
1
- f
4
had dimensions set at 100, 250, 500 750 and 1000 dimensions. The average results of fty runs
were taken on each dimension. The results showed that the computational time used by FEPCC to nd an
optimal solution grew in the order of O(n). For instance for function f
1
, 500000 tness evaluations were needed
for n = 100, 1250000 for n = 250, 2500000 for n = 500, 3750000 for n = 750 and 5000000 for n = 1000. The
results for multimodal function were not all that dissimilar. These are very interesting results as it would seem
to suggest that FEPCC scales linearly as the dimension size increases.
15
0.3 The Particle Swarm
Particle swarm optimisation (PSO) is a population based metaheuristic optimisation technique. Particle swarm
optimisation is similar to the genetic algorithm in that the system is initialised with a population of random
solutions. However, it is unlike the GA (and EP for that matter) in that each potential solution is also
assigned a randomised velocity. Operations like recombination and mutation simply play no part in particle
swarm algorithm. It was originally designed and introduced by Kennedy and Eberhart [9][18] in 1995. The
PSO approach utilises a cooperative swarm of particles where each particle represents a candidate solution to
explore the space of possible solutions to an optimisation problem. It is based on the simulation of the social
behaviour of birds or a school of sh. The algorithm was originally intended to graphically simulate the graceful
and unpredictable choreography of a ock of birds. Each particle is randomly or heuristically initialised and
then allowed to y around the search space. At each iteration, each particle is able to evaluate its own tness
and the tness of its neighboring particles such that each particle keeps track of its own solution, which resulted
in the best tness as well as the candidate solution for the best performing particle in the neighborhood. As a
metaheuristic, PSO makes few or no assumptions about the problem being optimised and can search very large
spaces of candidate solutions.
The PSO was originally developed for continuous valued spaces, but many problems are dened for discrete
valued spaces where the domain of the variables is nite. It has been shown elsewhere that this simple model can
deal with dicult optimisation problems eciently. This chapter attempts to elucidate on several key themes
beginning in the next section with the philosophical underpinnings of PSO, followed by an indepth look at some
of the technical and theoretical aspects. Finally, a discussion on scalability is undertaken.
0.3.1 Sociocognitive Foundations
In their book on Swarm Intelligence chapter six, Kennedy and Eberhart dwell on a theme in which they
characterize thinking as a social activity. They hypothesise that human culture and cognition are aspects of a
single process and that people learn from one another not only facts but methods for processing those facts.
Further, not only do people learn from each other, but as knowledge and skills spread, a population converges
on optimal processes. One inevitably acknowledges the link between culture and cognition in the truth of this
statement and as a consequence, the soundness of this line of reasoning. They describe this adaptation in a
cultural context as a phenomena that operates simultaneously on three levels:
Individuals learning locally from their neighbours or local social learning is an easily measured and well
documented phenomenon. People are quite aware of this culture of learning where individuals glean
insights from their neighbours and share their own insights in turn.
This spread of knowledge through social learning aects the sociological, economical and political strata of
society and results in emergent group-level processes which can be seen as regularities in beliefs, attitudes
and behaviours across individuals within a population.
Through local interaction, the spread of culture carrying insights and innovations from one individual to
the next, from originator to distant individuals, the role of culture in the optimisation of cognition cannot
be overemphasised; a continuous pattern of further combination of various innovations resulting in more
improved methods that are largely transparent to the actors who benet from it.
Such views were by no means novel at the time. Scientists had long theorized that culture improves human
adaptability. Boyd and Richerdson had previously argued that social learning allows individuals to avoid
the costs of individual learning which they characterised as costly. The relevance of this theory to particle
swarm is in the belief that individuals are thought to be inuenced by their experiences and the successes of
their neighbors; that is to say their neighbors experiences. Kennedy and Eberhart reference earlier work by
political scientist David Axelrod who proposed a computational model for the dissemination of culture. They
theorised that the process of cultural adaptation comprises a high-level component, described as observable
in the formation of patterns across individuals and the ability to solve problems, and a low-level component;
the actual and universal behaviours of individuals. A sociocognitive theory which summarizes the process of
cultural adaptation in terms of three principles was adopted:-
Evaluate - An organism is incapable of learning unless it possesses the ability to evaluate its environment,
can distinguish attractive and repulsive features in its environment, and can generally tell good from
bad. On this view, learning could be dened as a change that enables an organism to improve the
average evaluation of its environment. The propensity to evaluate stimuli is perhaps the most ubiquitous
behavioural characteristic of living organisms.
Compare - Theory suggests people use others as a standard for measuring themselves. Such comparisons
serve as a kind of motivation to learn and to change. The desire to compare even while contrasting is
16
second nature for most people. People will compare anything from the supercial like looks, cars or clothes
to more substantive things like I-Q. Particles in swarms also have the ability to compare themselves with
their neighbours. Particles are able to compare themselves with their topological neighbor on specic
parametric measurements and imitate only those neighbors who are considered superior to themselves.
Imitate - Here the concern does not extend to the opposite sides (real or imagined) of philosophical
debates as to the propriety or otherwise of real imitation outside of the human race. Kennedy and
Eberhart are satised to state that the combination of these three principles of evaluating , comparing
and imitating may be combined, even in simplied social beings in computer programs enabling them to
adapt to complex environmental challenges.
Taking together, this view then provides a simplied basis for understanding the inner workings of the particle
swarm algorithm. Particle Swarm then can be easily understood by viewing each particle in the swarm (bird
ock or sh school) as an individual in a population (society). Here focus is on particle swarm in continuous
numbers, that is to say the particle swarm algorithm searches an n-dimensional space of '
n
real numbers.
One may conceptualize each particle as a point in the Cartesian plane then it becomes a matter of a small
philosophical leap to suggest that multiple individuals can be generated within a single set of coordinates to
produce a population of points. As Kennedy et al put it, in this view change over time can be represented as
movement of points, learning may be viewed as cognitive decrease and increase along dimensions, changes in
attitude may be seen as movements between positive and negative ends of the axis etc. In essence, as multiple
individuals exists within the same high-dimensional framework, the coordinate system can be seen as containing
a number of moving particles. With respect to the evaluation of particles, the space in which the particles are
observed to move can be described as heterogeneous with some regions being better than others and there exists
a preference for better regions of this space that would allow a vector of cognitive, mathematical or engineering
parameters to be accurately evaluated.
The evolution of the Particle Swarm paradigm has its roots in the simulation of a simplied social model.
In other words, that PSO was rst intended as a stylized representation of the movement of organisms in a bird
ock or sh school, is a testament to the (not so) subtle link between social science simulations and computer
programs for engineering. Indeed, it is worthy of note that many psychological, physical and biological theories
have often inuenced and even spawned computational methods for problem solving in recent history.
0.3.2 Continuous PSO
The initial canonical particle swarm proposed by Kennedy and Eberhart implemented what is generally referred
to as the global or gbest model[18]. In this version, the value of the best particle in the neighborhood is the
best particle over the population of particles as opposed to the lbest version where a particles neighborhood is
dened within a local topological neighborhood[9]. In real-valued particle swarm, each particle is a potential
solution and can be conceptualized as a point in hyperspace.
The position of a particle i is assigned the algebraic vector symbol x
i
and there can be any number of
particles and each vector can be of any dimension. The change of position of a particle is denoted as v
i
for
velocity which is essentially a vector of numbers added to the position coordinates in order to move a particle
from one time step to the next. Each individual is presumed to be moving at all times and the direction of
movement is a function of the particles current position and velocity, the location of the particles best suc-
cess and the best position found by any member of the neighborhood. This can be expressed theoretically as [17]:
x
i
(t + 1) = f(x
i
(t), v
i
(t), p
i
, p
g
)
The particles are own through the search space by updating the position of the i
th
particle at each time step
t in real '
n
numbered space according to the following equation:
v
i
(t + 1) = v
i
+ c
1

1
(p
i
x
i
(t)) + c
2

2
(p
g
x
i
(t)) (4)
x
i
(t + 1) = x
i
(t) + v
i
(t) (5)
where
1
and
2
are random coecients dened by an upper limit, p
i
is the i
th
particles best position, p
g
is the
neighborhood best and x
i
is the i
th
particles current position at time t. In [18], Kennedy and Eberhart suggest
the integer 2 as a good value for c
1
and c
2
as it on average makes the weights for the social and cognition parts
to be 1.
The equation represents an oversimplication of the socio-psychological view that says individuals are more
aected by those who are more successful, persuasive or otherwise prestigious in society. In their initial paper,
Kennedy and Eberhart go on to try several variations to the algorithm but found this simplied version to
be most eective. Early tests applied the algorithm to training Neural Networks (NN) and also the extremely
nonlinear Schaer f6 function. The particle swarm optimiser was found to train the network to achieve 92
17
percent correct on test data as opposed to the 89 percent achieved on a backpropagation NN. With the Schaer
f6 function the particle swarm optimizer was able to approximate results reported for elementary genetic algo-
rithms. The eect of the random nature of the algorithm is that the particles cycle unevenly around a point
dened as the weighted average of the two bests:

i
p
i
+
2
p
g

1
+
2
.
The random numbers therefore cause a change in the exact location of the particle at every iteration. Be-
cause of a tendency to explode as oscillations become wider, a velocity damping method is usually required to
prevent explosion. One such technique is to dene a maximum velocity V
max
parameter that limit the particles
to oscillate within specied bounds by preventing velocity from exceeding it on each dimension d for individual i:
v
id
> V
max
then v
id
= V
max
else if v
id
< V
max
then v
id
= V
max
In [36], Shi and Eberhart observe that the search process of PSO without the rst part in equation 4, i.e.
the previous velocity, was not unlike a local search process where the search space statistically shrinks through
the generations. From simulations, they deducted that without the rst part, the ying particles velocities
are only determined by their current positions and their best positions in history and consequently, all particles
will tend to move toward the same position. In other words, the search area was observed to contract through
generations and in these circumstances, PSO will have little chance of nding the global optimum unless it was
within the initial search space and the nal solution will depend heavily on the initial seeds. On the ip side,
they deducted that with the rst part, particles had a tendency to expand the search space, that is to say they
have the ability to search a new area. They concluded that there was a trade o between global and local search
and that for dierent problems there should be dierent balances between the local search ability and global
search ability. In consideration of this, they introduced an additional term called inertia weight w to balance
the local and global search:
v
i
(t + 1) = w v
i
+ c
1

1
(p
i
x
i
(t)) + c
2

2
(p
g
x
i
(t)) (6)
The weight w can be a positive constant, or a positive linear or nonlinear function of time. Experiments were
conducted to determine the inuence of inertia weight on PSO performance. Schaers f6 benchmark problem
was adopted since it is a well known problem and its global optimum is known. For these experiments, PSO
was written in C and compiled using Borland C++ 4.5 compiler. For comparison purposes, all simulations
deployed the same settings except the inertia weight w parameter which was varied. The number of particles
(population size) was set at a healthy 20 with maximum velocity set at 2. The particles were dened in a range
(-100, 100) which meant particles could not move beyond this range in each dimension. The schaer function
has a dimension of 2 and the maximum number of iterations allowed was set at 4000. If PSO failed to nd an
acceptable solution after 4000 iterations, it was claimed that PSO failed in that particular run. Thirty runs
were conducted for each selected value for inertia weight w and for each value the average number of iterations
required to nd the global optimum was calculated.
The results showed that when w is small (< 0.8) if PSO found the global optimum, then it found the global
optimum relatively quickly which conrmed earlier suggestions that when w is small, PSO is more like a local
search algorithm. When w was large (> 1.2), PSO took more time to nd the global optimum and the chances
of PSO failing to nd the global optimum increased. In this situation, PSO was found to be rather like a global
search method as it tried to exploit new areas. The authors proposed a medium setting (0.9 < w < 1.2) as
having the best chance of nding the optimum even though it took a moderate number of iterations. It is clear
from these experiments that the bigger the inertia weight w, the less dependent on the initial population the
solution is and PSO is more capable to exploit new areas.
Shi and Eberhart conclude by espousing some general qualities of a good optimisation algorithm as having
a more exploitative ability at the beginning to nd a good seed and then having an exploratory ability to ne
search the local area around the seed. Accordingly, they dened inertia weight w as a decreasing function of
time as opposed to a constant. The inertia weight w was programed to start at a large value 1.4 and linearly
decreased to 0. Experiments were conducted which showed this version to have an even better performance.
All 30 runs found the global optimum and the average number of iterations required to nd the global optimum
turn out to be less than when w is larger than 0.9. It is clear from this study that the inertia weight w can be
used to control the search of the search space. Another method for controlling the search involves the use of a
pair of constriction coecients applied to various terms in the formula.
In [6], Clerc and Kennedy complete a theoretical analysis of the particles trajectories which lead to a
generalized model of the algorithm, containing a set of coecients to control the systems convergence tendencies.
18
To be precise, the constriction factor analytically chooses values for w, c
1
and c
2
in such a way that control
is allowed over the dynamical characteristics of the algorithm, including its exploration versus its exploitation
abilities. With this method, clamping of velocities is not necessary as the constriction coecient ensures particles
stay within the bounds changing the velocity update in equation 6 to:
v
i
(t + 1) = (v
i
+ c
1

1
(p
i
x
i
(t)) + c
2

2
(p
g
x
i
(t))) (7)
with
=
2
2

2
4
(8)
where = c
1
+ c
2
, > 4
As the value of tends to 4 (from above), the value of tends to 1 (from below), restricting the need
to damp the velocity if at all; as grows larger, tends to zero and the particles velocity is more strongly
damped. A good choice of constriction factor makes velocity clamping unnecessary, however it has been found
elsewhere that , combined with constraints on V
max
signicantly improved the performance of the algorithm.
An indepth discussion into this approach is undertaken in subsequent sections.
In [8] Eberhart and Shi compare the performance of particle swarm optimisation using constriction factor
and inertia weight. Performance was tested using ve well known non-linear test functions including the Sphere,
Rosenbrock, Rastrigrin, Griewank and Schaers f6 function. In all cases the population size was set to 30 with
a maximum number of iterations of 10,000. For the inertia weight, a time-varying inertia weight was used set
to 0.9 at the beginning of the run and made to decrease linearly to 0.4 at the maximum number of iterations.
V max was set to the maximum range Xmax and each of the two (px) terms were multiplied by an accelerated
constant of 2.0 (times a random number between 0 and 1). For Clercs constriction factor, was set to 4.1 and
the constant multiplier is thus 0.729, and each of the two (p x) terms were multiplied by 0.729 * 2.05 =
1.49445 (multiplied by a random number between 0 and 1). In all cases particles were allowed to y outside of
the region dened by Xmax. The results revealed the constriction method yielded faster results on almost all
test functions with a higher range/average quotient. The authors concluded that the best approach to use with
particle swarm optimisation as a rule of thumb is to utilize the constriction factor approach while limiting
V max to Xmax or in the alternative, utilize the inertia weight approach while selecting w, c
1
and c
2
according
to equation 8. The results also indicated that improved performance can be obtained by carefully selecting
values for the parameters w, c
1
and c
2
The success of PSO can be seen in yet another variation on the algorithm presented in [16]. Kennedy
noted that the standard particle swarm uses the best information in its topological neighborhood along with
the target individuals best success to dene the center of the particles oscillations. He noted a paradoxical
result when the neighborhood is expanded to include all members population so all particles are inuenced
by the best success found by any member of the swarm i.e the so called gbest topology. When this happens,
performance is known to suer on some multimodal functions (multiple peaks) as it had a tendency to misdirect
the population toward premature convergence on a local optimum. Kennedy observed several manipulations of
the algorithm that aected the centering of a particles oscillations and noted crucially that the amplitude of the
cycle had always been determined by the previous iterations velocity and the distance from the center. Tests
were conducted using a sampling distribution strategy to select the next point to test. A distribution of points
where p
i
and p
g
were held at a constant 10 was tried using the standard particle swarm. The results proved
that the dierence between the individual and neighborhood previous best points was an important parameter
for scaling the amplitude of particles trajectories; step size being a function of consensus.
Kennedy observed that when neighbors best points are all in the same region of the search space, particles
will tend to take smaller steps. This was due to the fact that the stochastic mean, which wanders as a result of
the random coecients, always lies between the two bests, and never outside their range. Based on the results
from gure 6, a barebones version of the particle swarm algorithm was suggested, which required doing away
with the whole velocity formula and instead generating normally distributed random numbers around the mean
on each dimension in the following manner:
P
id
+ P
gd
2
(9)
where P
id
is the target particle is previous best success and P
gd
is the neighborhood best success. The
standard deviation of the gaussian normal distribution was scaled using some consensus information from the
neighborhood as in [P
id
P
gd
[, that is to say the dierence on each dimension between the individuals previous
best and the best neighbors previous best position.
Tests were performed were versions diered in how g was dened: gbest, neighborhood best, or random neighbor
on ve test functions including Schaers f6, Griewank, Rastrigin, Rosenbrook and Sphere. The lbest canonical
particle swarm was also tested for comparison. The results revealed that while the canonical version converged
19
the fastest on Rastrigrin, it by no means outperformed the gaussian versions. The gbest gaussian version was
observed to converge fastest on the sphere function while the random-neighbor gaussian version was the slowest
even though it performed competitively on most functions particularly the f6 and caught up with the canonical
version on the Rastrigin function near about 3000 iterations.
In sum, Kennedy concluded that the entire velocity vector with all its implications, including inertia weights,
constriction coecients and all the other stu were simply not necessary. Further, as a result of the loss of
performance associated with the size of the neighborhood as mentioned earlier, Kennedy suggested that it might
not even be necessary to identify the best neighbor, even in a localized sociometry. He suggested a version where
all neighbors were considered simultaneously with a caveat that the individual will not inuence itself.
In [33], Mendes et al pursue this theme in implementing the Fully Informed particle Swarm (FIPS). The FIPS
model can be considered a generalization of the canonical model where instead of adding two terms to the
velocity and dividing the constant in half to weight each term, the FIPS distributes the weight across the
entire neighborhood:
v
i
(v
i
+
N
i

n=1

(0, )(Pnbr
n
x
i
)
N
i
) (10)
where N
i
is the number of neighbors around particle i and nbr(n) is particle is nth neighbor. In considering
the sociometry of the fully informed particle swarm, all neighbors are a source of inuence. Thus, it is the case
that the neighborhood size determines the diversity of inuence. Mendes et al noted that in an optimisation
algorithm, diverse inuences might dilute rather than enhance the quality of the search. As seen previously,
the topological structure of populations in PSO is a signicant factor in controlling its exploration versus its
exploitation tendency. In their paper, Mendes et al put it this way: that the behaviour of each particle is aected
by its local neighborhood and consequently topology aects search at a low level by dening neighborhoods,
this is to say particles that are acquainted with one another tend to search the same region of search space and
at a high level by dening relationships between the local neighborhoods.
Mendes et al report on tests conducted on FIPS using ve dierent social networks or neighborhoods in-
cluding the gbest topology which they refer to as the all topology, the ring topology most commonly associated
with the lbest particle swarm, the four clusters topology which sociologically resembles four mostly isolated
communities where a few individuals have acquaintances outside of their own group. This topology represents
four cliques connected among themselves by several gateways. Also the pyramid topology which represents
a three dimensional wire-frame pyramid having the lowest average distance of all the graphs and the highest
rst and second degree neighbors. The square topology commonly referred to as the von Neumann topology is
included and can be described as a graph representing a rectangular lattice that folds like a torus.
In eorts to nd a general problem-solver capable of working well with a wide range of problems (NFL
notwithstanding), their experiments manipulated neighborhood topologies, initialization strategies and various
algorithm details. Both symmetrical and asymmetrical initialization were used. Symmetrical initialisation was
performed over the entire spectrum of valid solutions, while asymmtrical initialisation started particles with an
oset. Based on these variations in settings and parameters, ve kinds of algorithms were tested.
The traditional canonical particle swarm with Clercs Type 1 constriction coecient was tested against wFIPS
which was a fully informed particle swarm where the contribution of each neighbor was weighted by the good-
ness of its previous best, wdFIPS also fully informed, with the contribution of each neighbor weighted by its
distance in the search space from the target particle, self also a fully informed model where the particles own
previous best received half the weight and nally wself another fully informed model, where the particles own
previous best received half the weight and the contribution of each neighbor was weighted by the goodness of
its previous best.
In total, nine topology conditions were tested including the ve previously mentioned and conditions where
some were tested with and without the inclusion of the target particle in the topological neighborhood. Various
criteria were used to measure performance however for this project only the performance measurement which is
an indication of how well a problem-solver is able to do within a limited amount of time will be considered. The
best performance under this measurement was the seless-square FIPS conguration, while some of the very
worst ones, neighborhoods using the wdFIPS algorithm which were more than three standard deviations worse
than the mean. All in all, the FIPS versions outperformed the canonical particle swarm on every dependent
measure however the study showed the very worst FIPS conditions were those were the entire population dened
the neighborhood as well as those were the target particle was a member of the neighborhood.
0.3.3 Binary PSO
Particle swarm optimisation was initially developed for continuous valued spaces, however because many prob-
lems are dened for discrete valued spaces, where the domain of the variables is nite, Kennedy and Eberhart
later developed a discrete binary version of PSO. Examples of these kinds of problems include problems which
20
require the ordering or arranging of discrete elements for instance scheduling or routing problems[19]. In binary
space particles can only take on one of two values at any one time, 1 or 0, true or false, yes or no and their
movement may be seen as ying around various corners of the hypercube by ipping various numbers of bits.
The major dierence between binary PSO and the continuous version is that the velocity of the particle may be
described by the number of bits changed per iteration or the Hamming distance between the particle at time t
and at t + 1. Changes to velocities v
id
, trajectories etc.. are dened in terms of changes in probabilities that a
bit will be in one state or the other. As a consequence velocity must be limited to within the range [1,0]. This
is easily accomplished with the use of a sigmoid function -
S(v
id
) =
1
1 + exp(v
id
)
(11)
According to Kennedy and Eberhart, a particles movement in state space is restricted to zero and one on each
dimension and the velocity v represents the probability that the bit x
i
will take a value of 1. The binary model
is similar to the continuous version as its movement is still a function of personal and social factors where each
particle is adjusted toward its own successes and the successes of the neighborhood:
v
i
(t + 1) = v
i
(t) +
1
(p
i
x
i
(t)) +
2
(p
g
x
i
(t))
if p
id
< S(v
id
(t)) then x
id
(t) = 1; else = 0 (12)
Except that now, p
id
and x
id
are integers in the range 0,1 and v
id
is constrained to the interval [0,1] accomplished
using the logistic function as described in equation 11. As with the continuous version of PSO, the V
max
term
is also retained in binary PSO, however again, this simply limits the probability that the bit x
id
will take on a
value of 0 or 1. It should be noted that while the a higher value for V
max
in the continuous version increases
the range searched by the particle, the opposite is the case for binary particle swarm; a smaller value for V
max
allows for a higher mutation rate. Trajectory in binary particle swarm is probabilistic and the probability of a
bit changing is given by
p() = S(v
id
)(1 S(v
id
)) (13)
which is the absolute nondirectional rate of change for a bit given a value of v
id
. In [19], Kennedy and Eberhart
note that the revision of the particle swarm algorithm from continuous to discrete operation may be more
fundamental than just simple coding changes may imply. for instance, in this model, the population members
are not viewed as potential solutions but as probabilities where the value of v
id
for each dimension determines
the probability that the bit x
id
will take on one value or another, and that the bit x
id
itself will not have a value
until it is evaluated. Kennedy and Eberhart tested this model on De Jongs suite of ve test functions. The
results demonstrated that apart from the lack of sucient precision in binary encoding, binary particle swarm
implementation was capable of solving these problems very rapidly Other variations of binary particle swarm
have since been implemented. In [35], the authors present a genetic binary particle swarm model introducing
birth and death operations and mortality rates to improve the dynamism of the population and to increase
the exploratory ability of the particles and consequently varying the size of the swarm. In this method, each
particle is considered a binary vector in d dimensional space i.e binary strings or chromosomes of length d.
Particle update is carried out in a manner similar to standard particle swarm, however after each update, new
individuals (children) are added to the swarm by birth operations and other individuals are allowed to die and
are removed from the swarm by death operations.
Sadri and Suen begin by explaining that at each generation t the population has a birth rate of b(t) 0 and also
at each generation t, the population has a death rate (or mortality rate: number of deaths per unit population
per unit time) of m(t) 0. They describe the overall birth and mortality b(t) and m(t) rate of the population
as non negative rates and dene them as follows:
r(t) = b(t) m(t) where: t b(t), m(t) 0 (14)
However, the overall birth rate r(t) of the population at generation t can be positive, negative or zero which can
aect the size of the population at the next generation relative to the current size of the population as follows:
If r(t) > 0 then P(t + 1) is greater than P(t)
If r(t) = 0 then P(t + 1) is equal to P(t)
If r(t) < 0 then P(t + 1) < P(t)
where P(t +1) is the size of the population at the next generation and P(t) is the current size of the population.
They further describe the change in the size of the population in a small time interval from t to t+t as directly
proportional to the birth and mortality rates t :
[Change in population size] = births deaths
21
P(t + t) P(t) b(t).P(t).t m(t).P(t).t (15)
Dividing both sides of the equation by the birth and death mortality rates and letting it approach zero, they
obtain the following dierential equation modeling the size of the population at dierent generations:
P

(t) = (b(t) m(t))P(t)


P

(t) = r(t)P(t)
(16)
with a unique solution:
P(t) = e
R(t)
, where: R(t) =

r(t)dt + c (17)
where e is a constant depending on the initial size of the population at time t = 0(P(0)). They give birth and
mortality rates of the population as
b(t) =
1
cos((t) +
1
) +
1
, where:
1

1
0 (18)
m(t) =
2
cos((t) + ) +
2
, where:
2

2
0 (19)
According to [35],
1
,
2
,
1
,
2
and are given small positive values and
1
,
2
[0, 2] with the condition

j

j
guaranteeing that both b(t) and m(t) will have non-negative values in all generations.
According to Sadri and Suen, if
1
= 0 or
2
= 0 the population is projected to have constant birth and
mortality rates which models the standard PSO and is therefore considered a special case in this model. To
nd the mortality rate r(t), the authors substitute b(t) and m(t) from equation 18 and 19 into equation 14 as
follows:
r(t) = cos(t + ) + (20)
where =
1

2
and accordingly and may be derived from
1
,
2
,
1
and
2
. Replacing r(t) in equation
17, they give the model for the population is give as:
P(t) = P(0)e
[

sin(t+)+t

sin()]
(21)
Sadri and Suen go on to simulate dierent situations for the population by varying the and parameters.
They extrapolate that determines the frequency of change with respect to population sizes and normally set
it at very small positive values. Experiments were conducted to test performance with the genetic binary PSO
and the standard binary PSO proposed by Kennedy and Eberhart in [19] on four benchmark functions. the aim
was to nd the global minimum in both low and high dimensional search spaces where N = 3, 15, 75 and 150
dimensions. For each function and for each value of N (dimension of space), both GBPSO and PSO were run 10
times with each looping around 80 generations. The GBPSO model started at one and gradually increased to a
maximum 397 after 80 generations. PSO was set with a population size of 400 particles and then 800 particles
to get a fair comparison. In each run, the best value of particle found by both algorithms and the average of
the best optimum values found by both after 10 generations were recorded. The results showed GBPSO was
found to converge much faster and found a better optima than ordinary PSO for the test functions in higher
dimensions. Finally, the authors would argue that there are biological and experimental arguments that explain
why adjusting population sizes in particle swarm is a promising area of research; an assertion that would be
dicult to oppose based on their study.
Two main problems and concerns with binary PSO are highlighted in literature[22]. Firstly, the interpreta-
tion of velocity and trajectories have a much deeper eect and meaning on the behavior of velocity clamping
and inertia weight which dier substantially from real-valued PSO. For instance, large numbers for maximum
velocity encourage exploration in continuous PSO, while the opposite is the case for binary PSO as already
noted by Kennedy and Eberhart. In the binary version, small values for V
max
promote exploration while large
values tend to limit exploration. Also, there are diculties in choosing a proper value for inertia weight. Values
of w < 1 prevents convergence and v
id
decreases to 0 over time for which S(0) = 0.5 which means for w < 1 we
have lim
t
S(v
id
(t) = 0.5). if w > 1 then velocity increases over time and lim
t
S(v
id
(t) = 1) in eect all
bits change to 1.
The second issue is that of memory. In light of equation 12 the next value for the bit is quite independent from
the current value of that bit, therefore the value is updated mainly using the velocity vector. In continuous
PSO, the update rule uses the current position of the swarm and the velocity vector to adjust the movement of
the particle through space. From this, one may deduct that the earlier statement concerning the similarity in
the way real valued PO and the binary version are updated may not be entirely accurate.
22
0.3.4 Neighborhood Topologies
Particle swarm can be described generally as a population of vectors whose trajectories oscillate around a region
that is dened by each particles previous best position and the best position of some other particle. Dier-
ent methodologies have variously been used to identify this some other particle to inuence the individual.
Traditionally, two dierent kinds of methods or neighborhood topology have been dened for PSO. The gbest
and lbest topology. In the gbest topology all particles are connected to every other particle in a kind of fully
connected neighborhood. The gbest topology was initially useful for applications typically involving nding a
matrix of weights for a feed forward neural network. The function landscape of this kind of problem is typically
made up of long gradients, however other problems contain variable interactions and other features that are
not typied by smooth gradients. The lbest topology was proposed as a way to deal with the more dicult
problems.
In lbest pso, the ring topology is most commonly used where each particle has access to its kth immediate
neighbor in terms of particle indices. The lbest topology oered the advantage that subpopulations could
search diverse regions of the problem space. Other topologies are the star topology where every individual is
connected to every other individual and the wheel topology in which individuals are isolated from one another
and all information is communicated through one focal individual. This focal individual compares performance
of all individuals and adjusts its own trajectory toward the best one.
Researchers have conducted experiments to determine the eect of neighborhood topology on particle swarm
performance [15]. In his paper on small worlds and mega minds , James Kennedy manipulated neighbor-
hood topologies optimizing four test functions. Kennedy explored a small world hypothesis which says that
random connections in an otherwise orderly network propitiate the spread of information through a population.
He conducted trials on both lbest and gbest particle swarms using various neighborhood topologies including
circles or ring topology, wheels, stars and random edges. In reported trials, population of 20 individuals were
implemented using Clercs constriction coecient. The methodology used included using analysis of variance
(ANOVA) to analyze data. Results revealed that there was a strong interaction between the function and type
of neighborhood. For instance, some populations were seen to perform better on certain functions when in
a circle or ring topology than in a wheel conguration proving quite conclusively that the sociometry of the
particle swarm has a signicant eect on its ability to nd the optima.
Neighborhoods model the structure of social networks and provide two kinds of information to the target
particle. The rst being the locations where neighbors have found relatively good solutions represent promising
regions in the search space. The second kind of information conveyed by neighborhoods is the distances between
particles which is indicative of how much consensus there is between particles, and determines the size of the
particles steps through the search space. Implementing neighborhoods in standard (gbest) PSO requires replac-
ing the velocity update as in equation 4 where p
g
is the best position found so far in i
th
particles neighborhood
as opposed to the best position found by the entire swarm. The original velocity update equation can be likened
to a fully connected network structure since every particle is attracted to the best solution found by the entire
swarm population. The primary purpose of neighborhoods is to prevent diversity within the swarm by impeding
the ow of information through the network.
In Kennedy and Eberhart 2001, comparisons of lbest and gbest canonical particle swarm revealed a dier-
ence in the convergence speed and ability to search over over the local optima. Kennedy in [20], examined the
eect of varying sociometric congurations in gbest and lbest including the von Newmann topology where k was
dened with self and without self on a suite of ve standard test functions in 30 dimensions except f6. It was
found that varying k signicantly aected the performance of the algorithm on all three dependent measures
- for instance, in the rst experiment, k=5 conditions had the best values at 1000 iterations (Stand. Perf.),
and required the fewest iterations to meet the criteria (Iter.) while k=3 had the highest success rate (Prop.).
In general the worst performer was the gbest conguration without self included. The gbest version with self
performed only marginally better on some measures.
The lbest populations with self were observed to be slow and inaccurate and only performed somewhat better
when the self was removed from the conguration. The von Neumann topology was found to be the most con-
sistent over all experiments than topologies commonly found in current practice. The von Neumann topology
comprises a kind of square topology where the population is arranged in a kind of rectangular matrix with
each individual connected to particles above, below and to either side wrapping around the edges. The study
suggested that while populations with greater connectivity speed up convergence, there is no evidence that it
improved the populations ability to discover the global optima.
Other studies have been carried out to test the eect of sociometrics on particle swarm. In [11], van der
Bergh tested the Guaranteed Convergence PSO(GCPSO) and canonical PSO using gbest, lbest and von New-
mann topologies against some standard unimodal and multimodal functions. GCPSO was introduced by van
der Bergh to address the issue of premature convergence to solutions that are not guaranteed to be the local
extrema. Modications to canonical PSO involve replacing the velocity update equation of the best particle
with the following:
23
v
ij
(t + 1) = wv
ij
(t) x
ij
+ p
ij
+ (t)r
j
(22)
where r
j
is a uniform random distribution sampled between

(1.1) and (t) is a scaling factor determined
using the standard sphere function -
f(x) =
n

i=1
x
2
i
(23)
where (0) = 1.0 and (t +1) ranges between 2(t) if number of success is greater than s
c
and 0.5(t) if number
of failures is greater than f
c
, otherwise (t), where s
c
and f
c
are tunable threshold parameters.
Performance on GCPSO and canonical PSO varied depending on dierent test functions however generally
the two lbest congurations were judged to be most unstable as evidenced by their high standard deviations.
An interesting result from this experiment is that the gbest congurations were observed to maintain a much
steeper descent in error value even after other congurations had attened out. It was also noted that the GPSO
appears not to benet from neighborhoods to the same extent as the standard PSO, however the GPSO did not
appear to be adversely aected by neighborhoods than the standard (canonical) PSO on unimodal functions.
The GCPSO using gbest topology seemed to outperform canonical pso overall. Fully Connected Particle Swarm
(FIPS) was tested in [21] against canonical pso by varying neighborhood topology conguration. In canonical
pso, velocity was adjusted by an amount that was determined by the weighted dierence between the individuals
previous best and the neighborhoods previous best positions. In FIPS, the velocity was adjusted by an amount
that is a kind of average dierence between each neighbors previous best and the target particles current position.
Canonical pso was found to perform better over 1000 iterations, with on average nearly half a standard deviation
below the mean. FIPS performed very badly on 1000 iterations in k=10 conditions but better that the canonical
version in k=3 conditions. The canonical algorithm performed evenly across all conditions. It was noted that
FIPS turned out to be more susceptible to alterations in topology, however it was found to outperform the
canonical version with good topology.
0.3.5 Particle Swarm Trajectories
A large body of research has been conducted to study and improve the performance of PSO since its rst publi-
cation. From these studies, researchers have invested much eort in the attempt to gain a better understanding
of the convergence properties of the particle swarm algorithm. A lot of these studies have taken the approach
of rst simplifying the algorithm by looking at the movement of one particle, perhaps in a single dimension and
then generalizing to multiple particles in multiple dimensions. In order to understand the movement of PSO,
the following terminology is adopted[28] [6]:-
x(t) is a population of m particles in a multidimensional space at time step t, where
x(t) = (x
i
(t), ......x
m
(t)) i = 1...m
x
i
(t) is the i
th
particle in d - dimensional space at time step t, where
x
i
(t) = (x
i
, ......x
d
) i = 1...d
The velocity of x
i
d in d dimensional space is represented thus:
v
i
(t) = (v
i
(t), ....., v
i
d) i = 1...d
N is the neighborhood of particle x
i
, and is dened in the context of a predened neighborhood topology
p
()
i
is dened to be the previous best (local best) position of particle x
i
at time step t such that f(p
()
i
(t))
f(p
i
(t 1)), for all t

t 1
p
i
(t) (global best) is the best position of particle x
i
at time step t such that p
i
N and f( p
i
(t))
f(p
i
(t 1)) for all

t t 1
From the above x
i
moves according to the following equation:
24
v
i
(t) = v
i
(t 1) + c
1
r
1
(p
()
i
(t) x
i
(t 1)) + c
2
r
2
( p
i
(t) x
i
(t 1)) (24)
x
i
(t) = x
i
(t 1) + v
i
(t) (25)
where c
1
and c
2
are two positive constants and r
1
and r
2
are random uniform distributions in the range [0,1].
At this point a one dimensional version of the system is considered where p
i
(t) = p
()
i
(t) by simplication. Thus
equations 24 and 25 become
v
i
(t) = v
i
(t 1) + (
1
+
2
)(x
()
t
x
i
(t 1)) (26)
x
i
(t) = x
i
(t 1) + v
i
(t) (27)
where
1
= c
1
r
1
,
2
= c
2
r
2
.
Ozcan and Mohan theorize that for the case where
1
,
2
and x
()
t
= p are constants, and imposing initial
conditions v(0) = v
0
and x(0) = x
0
, the system can be further simplied and thus a particles behavior can be
dened by the following equation:
v(t) = v(t 1) x(t 1) + p (28)
x(t) = x(t 1) + v(t) (29)
where =
1
+
2
. By a simple substitution, they obtain the following recursion from equations 28 and 29
x(t) = (2 )x(t 1)x(t 2) + p (30)
with the initial conditions set at x(0) = x
0
, x(1) = x
0
(1 ) + v
0
+ p
Ozcan and Mohan utilize generating functions and methods for solving non-homogeneous linear recurrence
equations to obtain the following closed form:
x(t) = (((2 + )/2)
t
) + (((2 )/2)
t
) + p (31)
where
=

2
4 (32)
= (x
0
p)( + )/(2) v
0
/ (33)
= x
0
p (34)
According to Ozcan and Mohan, it becomes possible to determine the trajectory of an isolated particle whose
personal best is the same as its global best. If x = 0, they give the trajectory equation as:
x(t) = (v
0
/)(((2 + )/2)
t
((2 )/2)
t
) (35)
=

2
4 (36)
is a positive real number if

=
2
4 > 0 otherwise it is a complex number. The trajectory therefore can
be analyzed in two domains, where ' (real) and (complex).
Real case
For the case where = 0, = 4 and > 4, is a real number.
Case = 0: Ozcan and Mohan consider this a special case producing the following recursive trajectory equation:
x(t) = 2x(t 1) x(t 2) (37)
solving the equation to get:
x(t) = (x
0
p) + v
0
t (38)
Ozcan and Mohan argue that given x
0
= p = 0, the particle will move in the initial direction with the initial
velocity for innity.
Case = 4: The recursive trajectory equation is as follows;
x(t) = 2x(t 1) x(t 2) + 4p (39)
making for a closed form of:
x(t) = ((x
0
p) + (2(x
0
p) v
0
)t)(1)
t
+ p (40)
25
According to Ozcan and Mohan, given x
0
= p = 0, the particle will be seen to move at consecutive time steps
in opposite directions with an increased speed proportional to the initial velocity (x(t) = v
0
t(1)
t
).
Case > 4: Going by equations 35 and 36, the trajectory is described by Ozcan and Mohan as an oscillatory
graph bounded by exponential functions. They advance the position that the steepness of the graph is deter-
mined by the initial velocity along with other parameters and that increasing the initial velocity causes the
envelope to diverge quickly, in turn causing the step size of the particle to increase further allowing for a larger
search of the search space. Ozcan and Mohan point out that when a particle is in an area where is real, it
increases its step size exponentially at each time step and that remedial action may be necessary to keep the
particle within bounds.
Complex case
They further theorise that is a complex number when 0 < > 4. The trajectory equation from equations 31
- 34 can be rewritten as follows:-
x(t) = z
t
1
+
t
2
(41)
where
z
1
= (2 +

)/2 (42)
z
2
= (2

)/2 (43)
= (x
0
p)(

+ )/(2

) v
0
/

(44)
= x
0
p (45)

= i

[
2
4[ (46)
if x
0
= p = 0, they simplify the equation as follows:
x(t) = (2v
0
/ |

|) sin(atan(|

| /(2 ))t) (47)


They determine that the trajectory of a particle in complex space is seen as a sinusoidal wave and the choice
of parameters determines the amplitude and frequency of the wave i.e. the choice of parameters determines the
direction and step size of the particle and the domain can be divided into subregions depending on the value of
as described below:
Case 0 < 2 -

3 0.268: Ozcan and Mohan work out that as decreases, the amplitude of the sine wave
will increase since |

| becomes < 1 i.e the wave frequency will be relatively smaller, making the period larger.
Case = 2: the trajectory equation becomes:
x(t) = v
0
sin(t/2) (48)
They note that for < 2, the amplitude of the sine waves will be positive and when > 2, the sine waves will
lag behind
Case 2 -

3 < < 2 and 2 < 2 +

3 3.732: |

| become < 1. In this instance, they argue that


the amplitude of the sine wave will be approximately v
0
since |

| is also bounded by a maximum of



5 in
magnitude.
Case 2 +

3 < < 4: the amplitude of the sine wave increases with , similar to Case 0, |

| becomes less
than 1, reducing the frequency which is inversely proportional to the period of a sine wave.
Ozcan and Mohan conclude that particle trajectories follow periodic sinusoidal waves and that an optimum is
found by randomly catching another wave and manipulating its frequency and amplitude[37]. These results
have been conrmed by others in empirical study, not least Kennedy 1998a.
In [6], Clerc and Kennedy apply a similar methodology to provide a theoretical analysis to ensure particle
convergence to a stable point in constricted trajectories. From equation 24 where a particles velocity is adjusted
by (p
()
i
(t)x
i
(t 1))), where p
()
i
(t) is the best position found so far by an individual particle in the rst term
or by any neighbor in the second term at time step t, Clerc and Kennedy simplify the formula by redening
p
()
i
(t) as follows:
p
()
i
=

1
p
()
i
+
2
p
i

1
+
2
(49)
and then simplify further to a particle whose velocity is adjusted by just a single term
v
id
(t + 1) = v
id
(t) + (p
id
x
id
(t)) (50)
26
again, where =
1
+
2
and substituting p
()
i
for p
id
. Clerc and Kennedy begin by considering a 1-dimensional
deterministic particle with a constant p.
v(t + 1) = v(t) + (p x(t)) (51)
x(t + 1) = x(t) + v(t + 1) (52)
where p and are constants. Thus a system of two recurrence relations of the rst order describing the system
was built where y(t) = (p x(t)).
v
t+1
= v
t
+ y
t
(53)
y
t+1
= v
t
+ (1 )y
t
(54)
They expressed the system as a matrix as follows:
P
t
=

v
t
y
t

= M P
t1
= M
t
P
0
(55)
M =

1
1 (1 )

(56)
and the eigenvalues of the system were derived thus[14]:
c
i
= 1

2

2
4
2
= 1

2


2
; i = 1, 2 (57)
By diagonalizing, Clerc and Kennedy proved that the position of a particle depends on its eigenvalues raised to
the power of the time step and on the initial conditions. They worked out that if at least one eigenvalue is not
smaller than one, then the system does not converge. In such cases, they propose a system of surrogate values
whose eigenvalues c
1
and c
2
are smaller than one. To achieve this, ve coecients are added to the system
whose values can be chosen to ensure convergence (, , , , ). The system is depicted as below

v
t+1
= v
t
+ y
t
y
t+1
= v
t
+ ( ) y
t
(58)
M =


( )

(59)
Note that in equation 58 and 59 is dierent from equation 57. Thus if the system in equation 55 does not
attain the threshold of both eigenvalues being smaller than one, constricted coecients are applied, where the
eigenvalues of the surrogate system are forced to have magnitudes of a value smaller than one. Clerc and
Kennedy studied two dierent constriction types, however this report will only consider a derivative of Type 1
(Type1

).
Type 1 Constriction With this type of constriction, for the particular case where = 1, the added coecients
are correlated as = = = and the system matrix and eigenvalues are correlated thus:
M =


(1 )

(60)
c
i
=
1 +

( )
2
(2 + 2 ) ( ) + ( 1)
2
2
=
(1 +

)
2
= c
i
(61)
According to Clerc and Kennedy, if the eigenvalues of the original system are complex conjugates or are real-
valued, then the same will apply to those of the constricted one. However, this is not the case for Type1

constriction. The values for (depending on ) for which the discriminant is negative or equal zero is given
by
max
min
=

1

+1

(62)
Clerc and Kennedy theorise that if eigenvalues of the complex system are complex conjugates, then their
magnitudes are given by

, consequently, convergence is ensured by the simple condition < 1, in other words


enforcing both conditions ensures convergence. They propose using the following to calculate the constriction
factor that ensures convergence
=

2
2+

2
4
if 4
if0 < < 4
; 0 < < 1 (63)
27
In [14], Innocente and Sienz validate these theoretical computations by calculating that if < 4, the
discriminant < 0 for
min
< <
max
, in this case they easily compute equation 62 since = . Further,
they also theorize from equation 62 that
min
< 4 for < 4 and 1/9 < < 1 as shown in equation 64
if

< 4
1
9
< < 1

0 <
min
< 4

max
> 4
(64)
According to [6][14], if = 4, equals and therefore there is continuity in the curves. However for the case of
if > 4 and the discriminant < 0 for 4 < <
min
with
min
as in equation 62, Innocente and Sienz admit
that in this instance the calculation for
min
is not as straightforward because the value of
min
increases with
. Referring to Clerc and Kennedys computation of values for
max
, the upper bound for the discriminant
< 0(which they referred to as
min
as a result of being calculated in the negative) where for = 0.40 (
min
= 8.07) and for = 0.99 (
min
= 39799.76). According to Innocente and Sienz, the calculation of is no
longer straightforward because it no longer represents the ratio between the magnitudes of the corresponding
eigenvalues of the original and the surrogate systems. The PSO update equation for this type of constriction is
given as:
x
t
= x
t1
+
v
t
. .. .

(x
t1
x
t2
)
. .. .
v
t1
+ (p x
t1
)

(65)
to ensure convergence, it is normal to replace with
max
since is a random variable implying that every
generated <
max
will be constricted more stronger than necessary. Generalizing for coecients that may
dier for dierent particles and change over time, Innocente and Sienze give the update equation for Type1

Constricted Original PSO (COPSO) as:


x
t
ij
= x
t1
ij
+
t
i

x
t1
ij
x
t2
ij

+ iw
t
i

(0,1)

pb
(t1)
ij
x
(t1)
ij

+ sw
(t)
i

(0,1)

lb
(t1)
ij
x
t1
ij

(66)
where

t
i
=

2
(t)
i

t
maxi
2+

(
(t)
maxi
)
2
4
(t)
maxi
if
t
maxi
> 4

(t)
i
otherwise
(67)
with

(t)
i
(0, 1)
Innocente and Sienz easily reduce this to Constricted PSO CPSO by using the constriction factor as the
constricted inertia weight (w
c
) and the constricted accelerated coecient (
c
) instead of the original () as
shown in the following equation:

w
c
= w =

c
=
(68)
In essence, this constriction works by scaling down the factor for the original coecients and ensures convergence
if the pair of constricted coecients
c
, w
c
fall within a particular convergence triangle. According to Clerc and
Kennedy, the use of constriction coecients can be viewed as a recommendation to take smaller steps, and that
the convergence is toward the point (v = 0, x = (
1
p
1
+
2
p
2
)/
1
+
2
), i.e v being the velocity and equal to 0
at the point of convergence.
Other researchers have studied the trajectory of PSO experimentally. In [34], Janson and Middendorf
investigate the moving behavior of a single particle on the two dimensional Sphere function with very specic
measurements. They observed a bias where most movement steps occurred parallel to one of the coordinate
axis. When they repeated the experiment with 5 particles they found that most of the particles end up at
a position next to one of the coordinate axis and that the major directions of movement are parallel to the
respective other axis. At this point they theorize that the optimisation behavior of PSO changes with the
rotations of the optimisation function around the point of origin as a consequence of the bias to the coordinate
axis earlier observed. A Rotatable function is dened to illustrate this behavior and a side -step mechanism
was developed to combat this bias. Eventually both standard PSO and side-step PSO were tested on a suite
of standard benchmark functions. For all tests, it was found that the gbest PSO implementing this side-step
technique performed better than an ordinary standard PSO. The same results applied to the lbest version,
which outperformed the standard lbest on all but one function. They concluded their experiments by nding
two potential weaknesses in the movement behavior of standard PSO. The rst being a distinct bias in their
movement direction inuenced by the direction of the coordinate axises and secondly an unwanted eect in that
the optimisation behavior of standard PSO is not invariant to rotations of the optimisation function.
28
0.3.6 Scalability
Few studies on the scalability of particle swarm are readily available in literature. In [27], Piccand et al motivated
by the applicability of Particle Swarm in the domain of sound synthesis conduct a series of experiments to
examine the myth as to whether dimensionality is a real problem for PSO. Their study was focused on standard
lbest and gbest PSO. The gbest model was implemented using a linearly decreasing inertia weight whereas the
constriction factor was retained for the lbest model. Further, in both models, velocity was limited to the size
of the search space where v
max
= x
max
and the search space was of the order [x
max
, x
max
]
N
. The aim was
to understand the eect of increasing the population size and/or the optimisation process on the nal quality
of the solution.
To compare the algorithms, four standard benchmark functions were used i.e. Ackley, Griewank, Rastrigin
and Rosenbrock. Each experiment consisted of a varying number of problem sizes between 30 and 200, while
problems of dimensions 300, 400 and 500 were also used on Ackley and Griewank. The swarm size was also
varied between a population of 25 to 500. Each experiment was run 50 times with 10,000,000 evaluation of
the benchmark functions. On Ackley, the results showed the gbest model to be successful at solving problems
of size equal to or below 200 dimensions regardless of the swarm size used. For problems of size 300 or more
the gbest conguration were found to fail more than 50% of the time. The lbest model was found to be even
less ecient on the ackley function as it was unable to solve a problem size of over 100 dimensions and even
75 in some cases. However it was able to solve problems of a smaller size faster in some cases than the gbest
model. On Griewank function the gbest model did not have any scalability issues for problem sizes up to 500
dimensions. It was found that the number of evaluations had to be increased quite regularly as the problem
size increases and the maximum number of evaluations of 10,000,000 was insucient when a large swarm size
is used. The preformance of the lbest model on Griewank was found to be quite good as it had a 100% success
rate except for a swarm size of 25 where the success rate was found to decrease quickly for a problem size of
100 dimensions. The lbest model performed better than the gbest model on griewank and did not have many
scalability issues as soon as the problem size was large enough. A population size of 50 was found to be a good
compromise for problems sizes of 500 dimension and below.
Both models were found to have scalability issues on rastrigin function. The gbest model failed to nd a
solution more than 50% of the time for a problem size of 100 or above except when the population size is
increased to about 300. The lbest model performs even worse on this function as it fails all the time for a
problem size of 75 dimensions and above regardless of population size used. The lbest model was found to
perform much better than the gbest on the rosenbrock function particularly when using the smallest swarm
sizes. Experiments showed that the number of evaluations required to solve probnlems of size 100 increased
exponentially for both models. On the basis of their ndings, the authors conclude that scalability is an issue
for the PSO and that to solve a problem of increasing size, one may nd it necessary to increase the swarm size
and run more iterations.
Another variant of PSO - Niche PSO - has been scaled successfully to higher dimensional domains [38].
Niching techniques are modelled after a natural phenomenon where species favour dierent environments based
on individual needs resulting in several species co-existing in a macro-ecology. A more familiar description is
to say they attempt to overcome deciencies of unimodal optimisation techniques by explicitly assuming that
multiple solutions may exist in the search space. In the text just cited, the performance of NichePSO on higher
dimendional domains is compared to two genetic algorithm niching techniques i.e. sequential niching (SN) and
deterministic crowding (DC). The NichePSO was found to eectively handle large numbers of solutions and was
not in any way hindered by an increase in problem dimensionality in contrast to SN and DC that were found
not to perform well.
29
0.4 Experimentation
As previously noted, although the diculty of a problem generally increases with its dimensionality, it is natural
that some high-dimensional problems be easier than others. For instance, if the decision variables involved in a
problem are independent of each other, one may easily solve the problem by decomposing it into a number of
sub-problems, each of which involves only one decision variable while treating the rest as constants. This class
of problem is known as separable problems and has been formally dened in [32] as follows:
Denition 1 A function f(x) is separable if
arg min
x
1
,...,x
n
f(x
1
, ..., x
n
) =

arg min
x
1
f(x
1
...), ..., arg min
x
n
f(..., x
n
)

(69)
The aim of this experiment is to test the hypothesis as to whether the scaling of the particle swarm is of linear
complexity on a class of separable test functions. The constricted coecient [6] particle swarm is implemented
and tested against two separable test functions the properties of which are dened as follows[5]:
F
1
: Shifted Sphere Function
F
1
(x) =
D

i=1
Z
2
i
+ f_bias
1
, z = x - o, x = [x
1
, x
2
, ..., x
D
] (70)
D : dimensions. o = [o
1
, o
2
, ..., o
D
] : the shifted global optimum
Figure 1: Shifted-Sphere
Properties
Unimodal
Shifted
Separable
x [-100, 100]
D
Global optimum: x

= o, F
1
(x)

= f_bias
1
= -450
30
F
4
: Shifted Rastrigins Function
F
4
(x) =
D

i=1
(z
2
i
10cos(2z
i
) + 10) + f_bias
4
, z = x - o, x = [x
1
, x
2
, ..., x
D
] (71)
D : dimensions. o = [o
1
, o
2
, ..., o
D
] : the shifted global optimum
Figure 2: Shifted-Sphere
Properties
Multimodal
Shifted
Separable
x [-5, 5]
D
Global optimum: x

= o, F
1
(x)

= f_bias
4
= -330
0.4.1 Method
The Sphere function is run in 10,20,30,40,50,60,80 and 100 dimensions, while Rastringin function is run in 30,50
and 100 dimensions. The maximum number of evaluations is set to 5000 times number of dimensions except in
30 dimensions when the maximum number of evaluations is set to 300,000. The error value is set to 10
8
and
the global optimum is shifted according to the bias in each objective function denition. Both functions are
run 50 independent times in each dimension and the average number of function evaluations/best nal value is
calculated using the equation:
x =
1
n
n

i=1
x
i
(72)
where x represents the mean and x
i
is the i-th d-dimensional observation, n is the total number of observations
in the population. The standard deviation is the square root of the variance
2
given as:

2
=
1
n
n

i=1
(X
i


X)
2
(73)
where
2
is the variance, the standard deviation =

2
. The standard error is therefore the standard
deviation divided by the squareroot of the total number of observations /

n.
PSO with constriction factor is implemented using asynchronous update in MATLAB on a 32-bit 2.00 GHz
system. A simple design methodology implements the model in the following function les(see appendix):
Init_lbest.m
This is the main initialisation le which sets variables like the population size, neighborhood size, acceleration
constants as well as the center and range of the search space which is determined by the test problem being
used at any one time. It also calculates the constriction factor once at the start of every run. This le is called
from within the main program at the start of each run when it randomly intialises the particles within the
search space. It takes an argument of function name (objective function) and returns a matrix containing the
positions of the particles and their current velocity.
31
l_best_async.m
The main program le which takes two arguments; a string representing the name of the objective function
and an integer representing the number of dimensions. As well as setting other variables this function contains
the main program loop which can be described as layered. Two while loops with one running within the other
are implemented. The outer loop controls the number of runs or trials using two variables. One int_trial
which is incremented at the start of every run and a second variable num_trials which represents the total
number of trials. The inner loop controls the maximum number of evaluations to termination implemented
using a max_FE variable and an FE_counter variable. A for loop is further nested within the inner loop
that allows each particle to be evaluated asynchronously. This function calls the Init_lbest function to initialise
all particles the calls objective function once within the outer loop and then iteratively within the nested for
loop. It checks for convergence before updating particle positions. Finally it calls a routine to save the results.
benchmark_func.m
This function implements benchmark functions (Sphere and Rastringin) as given in [5]. It takes a matrix rep-
resenting particle positions and an integer representing the objective function as arguments. Depending on the
objective function, this routine performs a series of transformations on the data by shifting the global minimum
in any direction according to a bias
save_trial_data
A simple routine to save each trial data (i.e number of function evaluations and best nal value) to le at the
end of each trial which could be when the program achieves stable convergence or when it reaches the maximum
number of evaluations.
norm_mat.m
A utility that takes a matrix and forces the data to t within a new range.
force_col.m
A utility function to force a vector to be a single column.
0.4.2 Results
PSO with constriction coecient is able to nd the global minimum in the Sphere function 100% of the time in 10,
20, 30 and 40 dimensions. The graph below show the average number of evaluations increasing as dimensionality
increases. In 50 dimensions, PSO only fails to nd the global optimum 4 times which is approximately equal to
over 90% success rate in 50 dimensions. However its success rate rapidly deteriorates as dimensionality increases
above 50 falling to less than 40% in 60 dimensions. PSO on nds the global optimum 10 times in 80 dimensions
and fails to nd the global optimum in all 50 runs except on two occasions in 100 dimensions.
Table 1: F1 - Function Evaluations Results
Func n No of Runs Mean Std Dev Std Err
f1 10 50 19923 1055 149.3
f1 20 50 41557 2013 284.6
f1 30 50 68646 7677 1085.7
f1 40 50 106145 17039 2409.6
f1 50 50 157025 49285 6969.9
f1 60 50 244855 60156 8507.4
f1 80 50 362231 69514 9830.7
f1 100 50 490257 48562 6867.7
The standard deviation values in Table 1 peaks at 80 dimensions after which it starts to decline at 100 dimensions.
This gives an idea of characteristic bell shaped curve of the normal distribution. It is noted here that though
the average number of evaluations increases with problem dimensionality, the performance of PSO on problems
in larger dimensional domains of 50 and above does not easily lend itself to the theory of a linear complexity in
the scaling of PSO.
32
The Rastrigin function is a heavily multimodal function and PSO was unable to nd the global optimum in
any dimension. This seems to be consistent with results obtained in [27]. Table 2 shows results of the best nal
value for PSO on 30, 50 and 100 dimensions. Even though PSO was set at a maximum number of evaluations
of 300,000 in 30 dimensions, it was still unable to nd the global minimum. Time and again particularly in the
higher dimensions, PSO was observed to become trapped in local minima for sustained periods.
Table 2: F4 - Best Final Value Results
Func n No of Runs Mean Std Dev Std Err
f4 30 50 -186 46 6.5
f4 50 50 42.98 109.4 15.47
f4 100 50 707.7 375.81 53.14
Table 3: F4 - Function Evaluation Results
Func n No of Runs Mean Std Dev Std Err
f4 30 50 300024 13.42 1.898
f4 50 50 250018 11.95 1.690
f4 100 50 500017 12.28 1.737
The graphs below depicting the average best nal value and the average number of evaluations for the Rastrigin
function appear to be quite revealing. While the average best nal value graph appears to scale linearly,
the average number of function evaluations does not. The reader may recall we set the maximum number of
evaluations to 5000 times number of dimensions in all dimensions except 30 dimensions where the maximum
number of evaluations was set at a steady 300,000. The question of a symbiotic relationship between the best
nal value and the maximum number of evaluations may be considered overly-simplistic by far more experience
minds, nevertheless, it does give one a pause for thought.
33
0.4.3 Conclusion
This study has found the particle swarm algorithm to be an eective optimiser on the unimodal Sphere function
of a low dimensional complexity i.e 1-30 dimensions. Experiments showed the average number of function
evaluations scale linearly up to 40 dimensions with a 100% success rate. In higher dimensions of 50 and above
the success rate of PSO was observed to decrease rather rapidly even as the average number of evaluations
continued to increase. For the multimodal Rastrigin function, PSO performed extremely poorly as it was
unable to nd the global optimum in any dimension. It was observed to become trapped in local minima
for sustained periods. The average best nal values in Rastringin seem to scale linearly in spite of its poor
performance. Varying the population size as suggested in cited text may help to improve its performance by
leveraging the advantage inherent in a larger population. This technique as well as applying the cooperative
coevolution approach to scaling is a potential area for future study.
34
Bibliography
[1] David E Goldberg Abhishek Verma, Xavier Llora and Roy H Campbell. Scaling genetic algorithms using
mapreduce. Intelligent Systems Design and Applications, 2009.
[2] Masaharu Munetomo Asim Munawar, Mohammed Wahib and Kiyoshi Akama. A survey: Genetic al-
gorithms and the fast evolving world of parallel computing. IEEE International Conference on High
performance Computing and Communications, 2008.
[3] Thomas Back and Hans Paul Schwefel. An overview of evolutionary algorithms for parameter optimisation.
Evolutionary Computation, 1993.
[4] Edwin K.P Chang and Stanislaw H Zak. An Introduction to Optimisation. Wiley Interscience, 2008.
[5] K Tang X Yao P N Suganthan C N Macnish Y P Chen C M Chen and Z Yang. Benchmark functions
for cec 2008 special session and competition on large scale global optimisation. IEEE World Congress on
Computational Intelligence, 2008.
[6] Maurice Clerc and James Kennedy. Explosion stability and convergence in a multidimensional complex
space. IEEE Transactions on Evolutionary Computation, 2002.
[7] C Coello coello Gary B lamont and David Van Veldhuizen. Evolutionary Algorithms for solving Multiob-
jective Problems. Springer, 2007.
[8] Rc Eberhart and Y Shi. Comparing inertia weights and constriction factors in particle swarm optimisation.
IEEE, 2000.
[9] Russell Eberhart and James Kennedy. A new optimiser using particle swarm theory. IEEE Micro Machine
and human Science, 1995.
[10] A.E Eiben and J.E Smith. Introduction to Evolutionary Computing. Springer, 2003.
[11] F van den Bergh ES Peer and AP Engelbrecht. Using neighborhoods with guaranteed convergence pso.
IEEE, 2003.
[12] David E Goldberg. Genetic Algorithms in search, optimisation and machine learning. Addison-Wesley
Publishing Company, 1989.
[13] JRandy L. Haupt and Sue Ellen Haupt. Practical Genetic Algorithms. Wiley Publishers, 2004.
[14] Mauro Sabastian Innocente and Johann Sienz. Particle swarm optimisation with inertia weight and con-
striction factor. International Conference on Swarm Intelligence, 2011.
[15] James Kennedy. Small worlds and mega-minds: Eects of neighborhood topology on particle swarm
performance. IEEE, 1999.
[16] James Kennedy. Bare bones particle swarms. IEEE, 2003.
[17] James kennedy and Russel C Eberhart. Swarm Intelligence. Morgan Kaufmann Publishers, 2001.
[18] James Kennedy and Russell Eberhart. Particle swarm optimisation. IEEE, 1995.
[19] James Kennedy and Russell Eberhart. A discrete binary version of particle swarm algorithm. IEEE, 1997.
[20] James Kennedy and Rui Mendes. Population structure and particle swarm performance. IEEE, 2002.
[21] James Kennedy and Rui Mendes. Neighborhood topologies in fully informed and best-of- neighborhood
particle swarms. Proceedings of the 2003 IEEE International Workshop, 2003.
35
[22] Mojtaba Ahmadieh khaneshar and K. N. Toosi. A novel binary particle swarm. IEEE proceedings, 2007.
[23] Graham Kendall Kumara Sastry and David Goldberg. Genetic algorithms chapter 4. IEEE, 2000.
[24] Yong Liu and Xin Yao. Evolutionary programming made faster. IEEE Transactions on Evolutionary
Computation, 1999.
[25] Rui Mende and James Kennedy. Neighborhood topologies in fully informed and best-of-neighborhood
particle swarms. IEEE Transactions of Evolutionary Computation, 2004.
[26] H. Muhlenbein. Parallel genetic algorithms, population genetics and combinatorial optimisation. Proceed-
ings at the third international conference on genetic algorithms, 1989.
[27] S Piccand M ONeill and J Walker. On the scalability of particle swarm optimisation. IEEE Congress on
Computational Intelligence, 2008.
[28] JEnder Ozcan and Chilukuri K Mohan. Analysis of a simple particle swarm optimization system. Intelligent
Engineering Systems Through Articial Neural Networks, 1998.
[29] Erick Cantu Paz. A survey of parallel genetic algorithms. Illinois Genetic Algorithm Laboratory, 2000.
[30] Mitchelle A Potter and Kenneth A De Jong. A cooperative coevolutionary approach to function optimisa-
tion. Third Parallel Problem Solving From Nature, 1994.
[31] Wilson Rivera. Scalable parallel genetic algorithms. Kluwer Academic Publishers, 2000.
[32] A Auger N Hansen N Mauny R Ros and M Schoenauer. Bio-inspired continuous optimisation: The coming
of age. CEC 2007, Singapore, 2007.
[33] James Kennedy Rui Mendes and Jose Neves. The fully informed particle swarm: Simpler, maybe better.
IEEE Transactions of Evolutionary Computation, 2004.
[34] Janson S and Middendorf M. On trajectories of particles in pso. Swarm Intelligence Symposium, 2007.
[35] Javad Sadri and Ching Y Suen. A genetic binary particle swarm optimisation model. IEEE Congress on
Evolutionary Computation, 2006.
[36] Yuhui Shi and Russell Eberhart. A modied particle swarm optimiser. IEEE, 1998.
[37] F van den Bergh and A. P Engelbrecht. A study of particle swarm optimisation trajectories. Journal of
Information Sciences, 2005.
[38] R Brits A P Engelbrecht F van den Bergh. Scalability of niche pso. IEEE Swarm Intelligence Symposium,
2003.
[39] Xin Yao Yong Liu, Qiangfu Zhao and Tetsuya Higuchi. Scaling up fast evolutionary programming with
cooperative coevolution. Proceedings of the 2001 congress on Evolutionary Computation, 2001.
36
List of Figures
1 Shifted-Sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Shifted-Sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
37
List of Tables
1 F1 - Function Evaluations Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2 F4 - Best Final Value Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 F4 - Function Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
38
APPENDIX A
function l_best_async(func_name, dimension)
global dim;
global f;
global fp;
global pos;
global l;
global vel;
global ac1;
global ac2;
global ac11;
global ac12;
global vr;
global neighd_size;
global FE_counter;
global chi;
global int_trials
global max_FEs;
format short e;
dim = dimension;
max_FEs = 5000*dim;
if_bias = -330;
ergrd = 10
8
;
int_trials = 0;
true_global_minimum = 0;
global_minimum = true_global_minimum + f_bias;
DateString = datestr(now);
DateString_dir = DateString;
DateString(12) = -;
DateString([15, 18]) = .;
DateString(3) = [ ];
DateString(7) = [ ];
DateString(7) = [ ];
DateString_dir(12) = -;
DateString_dir([15, 18]) = .;
DateString_dir([3, 8, 9]) = [ ];
num_trials = 52;
if mod(neighd_size, 2) == 0
size_to_left = neighd_size/2;
size_to_right = size_to_left;
else
size_to_left = floor(neighd_size/2);
size_to_right = size_to_left + 1;
end
while int_trials < num_trials
[x_pos, vel_x] = Init_lbest(func_name);
xpos = x_pos;
vel = vel_x;
pos = xpos;
l = pos;
fp = ones (popsz , 1) * Inf ;
f = zeros (popsz , 1);
f = benchmark_func(xpos,4);
39
FE_counter = popsz;
fg = min(f);
posmaskmin = repmat(vr(1:dim,1),popsz,1);
posmaskmax = repmat(vr(1:dim,2),popsz,1);
while FE_counter < max_FEs
hxpos = zeros(popsz,dim);
minpos_throwaway = xpos <= posmaskmin;
maxpos_throwaway = xpos >= posmaskmax;
for internal_i = 1:popsz
if sum(minpos_throwaway(internal_i,:))< 1 sum(maxpos_throwaway(internal_i,:))< 1
hxpos(internal_i,:) = xpos(internal_i,:);
FE_counter = FE_counter + 1
f(internal_i) = benchmark_func(hxpos(internal_i,:),4);
end
if f(internal_i) <= fp(internal_i)
fp(internal_i) = f(internal_i);
pos(internal_i,:) = xpos(internal_i,:);
end
l(internal_i,:) = pos(internal_i,:);
f_neighd_best = fp(internal_i);
for neighd_index_unadjusted = (internal_i - size_to_left):(internal_i + size_to_right)
neighd_index_adjusted = mod(neighd_index_unadjusted + popsz, popsz);
if neighd_index_adjusted == 0
neighd_index_adjusted = popsz;
end
if fp(neighd_index_adjusted) < f_neighd_best
l(internal_i, 1:dim) = pos(neighd_index_adjusted, 1:dim);
f_neighd_best = fp(neighd_index_adjusted);
end
end
errchk = fg - global_minimum;
fg = min(f);
if errchk <= ergrd
disp(reached error gradient);
break;
end
ac11 = ac1.*rand(popsz,dim);
ac12 = ac2.*rand(popsz,dim);
vel = chi*(vel + ac11.*(pos - xpos)+ac12.*(l - xpos));
xpos = xpos + vel;
end
if errchk <= ergrd
f_file = fullfile(C:,Users,kingsley,Documents,MATLAB,lbest_psoII,data_files,...
test_data,Lbest);
save([f_file,\lbest, DateString,,,F1,,,clerc-kennedy,,,...
async,,,num2str(dim),,,FE_Counter,num2str(FE_counter),,,fg,num2str(fg),,,...
40
error,num2str(errchk),,,...
Trial ,num2str(int_trials),.mat],FE_counter,fg,errchk);
FE_counter = 0;
break;
elseif FE_counter >= max_FEs
f_file = fullfile(C:,Users,kingsley,Documents,MATLAB,lbest_psoII,data_files,...
test_data,Lbest);
save([f_file,\lbest, DateString,,,F1,,,clerc-kennedy,,,...
async,,,num2str(dim),,,FE_Counter,num2str(FE_counter),,,fg,num2str(fg),,,...
error,num2str(errchk),...
Trial ,num2str(int_trials),.mat],FE_counter,fg,errchk);
FE_counter = 0;
break;
else
continue;
end
end
end
end
41
APPENDIX B
function[xpos, vel] = Init_lbest(f_name)
global dim;
global popsz;
global ac1;
global ac2;
global ac11;
global ac12;
global rnd1;
global rnd2;
global neighd_size;
global mxvel;
global vr;
global chi;
global phi;
global int_trials;
neighd_size = 2;
popsz = 50;
vmax_perc = .5;
Rand_seq_start_point = int_trials*104729;
rand(twister, int_trials*Rand_seq_start_point)
ac1 = 2.05;
ac2 = 2.05;
rnd1 = rand(popsz,dim);
rnd2 = rand(popsz,dim);
if strcmpi(f_name,sphere_shift_func)
init_range = 2*100;
init_center = 0;
vmin = ones(dim,1)*-100;
vmax = ones(dim,1)*100;
vr = [vmin,vmax];
elseif strcmpi(f_name,rastrigin_shift_func)
init_range = 2*5;
init_center = 0;
vmin = ones(dim,1)*-5;
vmax = ones(dim,1)*5;
vr = [vmin,vmax];
elseif strcmpi(ackley_shift_func,ackley_shift_func)
init_range = 2*32;
init_center = 0 ;
vmin = ones(dim,1)*-32;
vmax = ones(dim,1)*32;
vr = [vmin,vmax];
end
center_IS = repmat(init_center.*ones(1, dim), popsz, 1);
range_IS = repmat(init_range.*ones(1, dim), popsz, 1);
mxvel = 4;
xpos(1:popsz, 1:dim) = center_IS + range_IS.*rand(popsz, dim) - range_IS./2;
vel(1:popsz, 1:dim) = normmat(rand([popsz,dim]),[forcecol(-mxvel),forcecol(mxvel)]);
42
kappa = 1;
ac11 = ac1.*rnd1;
ac12 = ac2.*rnd2;
phi = ac1 + ac2;
chi = 2 * 1/ ( phi - 2 + sqrt ( phi
2
- 4 * phi ));
end
43
APPENDIX C
function [out,varargout]=normmat(x,newminmax)
a=min(x,[ ,1);
b=max(x,[ ,1);
for i=1:length(b)
if abs(a(i))>abs(b(i))
large(i)=a(i);
small(i)=b(i);
else
large(i)=b(i);
small(i)=a(i);
end
end
den=abs(large-small);
temp=size(newminmax);
if temp(1)*temp(2)==2
newminmaxA(1,:)=newminmax(1).*ones(size(x(1,:)));
newminmaxA(2,:)=newminmax(2).*ones(size(x(1,:)));
elseif temp(1)>2
error(Error: for method=1, range matrix must have 2 rows and same columns as input matrix);
else
newminmaxA=newminmax;
end
range=newminmaxA(2,:)-newminmaxA(1,:);
for j=1:length(x(:,1))
for i=1:length(b)
if den(i)==0
out(j,i)=x(j,i);
else
z21(j,i)=(x(j,i)-a(i))./(den(i));
out(j,i)=z21(j,i).*range(1,i)+newminmaxA(1,i);
end
end
end
varargout1=a;
varargout2=b;
return
44
APPENDIX D
function save_trial_data(dt_dir,int_trial)
global dim global FE_counter; global fg_final_per_trial;
disp(in save trial data);
DateString = datestr(now);
DateString(12) = ;
DateString([15, 18]) = .;
DateString(3) = [];
DateString(7) = [];
DateString(7) = [];
f_file = fullfile(C:,Users,kingsley,Documents,MATLAB,lbest_psoII,data_files,
...test_data);
save([f_file,\Lbest,\, dt_dir,\PSO,\lbest,
DateString,,,F1,,,clerc-kennedy,,,async,,,num2str(dim),,,Trial ,
num2str(int_trial),.mat],fg_final_per_trial,FE_counter);
end
45
APPENDIX E
function[out]=forcecol(in)
len=numel(in);
out=reshape(in,[len,1]);
46

S-ar putea să vă placă și