Sunteți pe pagina 1din 4

Genetic Algorithm and its Application in Data Mining

Genetic Algorithms

There are no known polynomial time algorithms to solve many real-world optimization
problems making them hard to solve. A number of heuristics have been designed to
solve the hard problems. These heuristics may provide sub-optimal but acceptable
solution in a reasonable computational time. A number of meta-heuristics such as
simulated annealing, evolutionary algorithms, artificial neural networks derived from
natural physical and biological phenomena have also been used to solve these problems.

Genetic Algorithms (GAs) are adaptive procedures derived from Darwins principal of
survival of the fittest in natural genetics. GA maintains a population of potential
solutions of the candidate problem termed as individuals. By manipulation of these
individuals through genetic operators such as selection, crossover and mutation, GA
evolves towards better solutions over a number of generations. Implementation of a
genetic algorithm is shown in a flowchart in figure-1.

Create initial population of


chromosomes / individuals

Evaluate fitness of individuals

Select the individuals

NO

Apply genetic operators


(Crossover and Mutation)

Finished all generations in the


genetic algorithm /
stopping criteria?

Figure-1: Flowchart of a genetic algorithm

Genetic algorithms start with randomly created initial population of individuals that
involves encoding of every variable. A string of variables makes a chromosome or
individual. In the beginning phase of implementation of genetic algorithm in early
Genetic Algorithm and its Application in Data Mining

seventies, it was applied to solve continuous optimization problems with binary coding
of variables. Binary variables are mapped to real numbers in numerical problems. Later,
GA has been used to solve many combinatorial optimization problems such as 0/1
knapsack problem, travelling salesperson problem, scheduling problems, etc. Binary
coding has not been found suitable to solve many of these problems. Therefore, coding
other than binary have also been utilized. Continuous function optimization uses real-
number coding. Problems such as traveling salesperson problem and graph coloring use
permutation coding. Genetic programming applications use tree coding.

GA use fitness function derived from the objective function of the optimization problem
to evaluate the individuals in a population. Fitness function is the measure of an
individuals fitness, which is used to select individuals for reproduction. Many of the
real world problems may not have a well defined objective function and require the user
to define a fitness function.

Selection method in a GA selects parents from the population on the basis of fitness of
individuals. High fitness individuals are selected with higher probability of selection to
reproduce offsprings for the next population. Selection methods assign a probability P(x)
to each individual in the population at current generation, which is proportional to the
fitness of individual x relative to rest of the population.

Fitness-proportionate selection is the most commonly used selection method. Given fi as


the fitness of ith individual, P(x) in fitness-proportionate selection is calculated as:
f
P ( x) = x . After the expected values P(x) are calculated, the individuals are
fi
selected using the roulette wheel sampling in the following steps.

Let C be the sum of expected values of individuals in a population.


Repeat two or more times to select the parents for mating.
i. Choose a uniform random integer r in the interval [1,C].
ii. Loop through the individuals in the population, summing the expected
values, until the sum is greater than or equal to r. The individual
index where the sum crosses this limit is selected.

The fitness-proportionate selection is extremely biased towards the fit individuals in the
population and exerts high selection pressure. It causes pre-mature convergence of GA
as population is made up of highly fit individuals after a few generations and there is no
fitness-bias for selection procedure to work. Therefore, other selection methods such as
tournament selection, rank selection are used to avoid this biasness. Tournament
selection compares two or more randomly selected individuals and selects the better
individual with a pre-specified probability. Rank selection calculates probability of
selection of individuals on the basis of ranking according to increasing fitness values in a
population.

In a standard genetic algorithm, two parents are selected at a time and are used to create
two new children to take part in the next generation. The offsprings are subject to
crossover operator with a pre-specified probability of crossover. Single-point crossover
is the most common form of this operator. It marks a random crossover spot within the

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
295
Genetic Algorithm and its Application in Data Mining

size of chromosome and exchanges the bits (in binary coding) on the right of the spot as
shown below.

01010101 0100 010101011101


01110101 1101 011101010100

Mutation operator is applied to all the children after crossover. It flips each bit in the
individual with a pre-specified probability of mutation. An example of mutation is given
below where fifth bit has been mutated.

0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1

The procedure is repeated till number of individuals in the population is complete. It


finishes one generation in genetic algorithm. GA is run till a stopping criterion is
satisfied that may be defined in many ways. Pre-specified number of generations is the
most used criterion. Other criteria are the desired quality of solution, the number of
generations without any improvement in the results, etc.

A standard genetic algorithm utilizes three genetic operators: reproduction (selection),


crossover and mutation. Elitism in genetic algorithms is used to ensure that the best
individual in a population is passed on unperturbed by genetic operators to the
population at next generation.

Values of genetic parameters such as population size, crossover probability, mutation


probability, total number of generations affect convergence properties of the genetic
algorithms. Values of these parameters are generally decided before start of GA
execution on the basis of previous experience. Experimental studies recommend the
values of these parameters as: population size equal to 20 to 30, crossover probability
between 0.75 to 0.95, and mutation probability between 0.005-0.01. The parameters
may also be fixed by tuning in trial GA runs before start of actual run of the GA.
Deterministic control and adaptation of the parameter values to a particular application
have also been used to determine values of genetic parameters. In deterministic control,
value of a genetic parameter is altered by some deterministic rule during the GA run.
Adaptation of parameters allows change in their values during the GA run on the basis
of performance previous generations in the genetic algorithm. In self-adaptation, the
operator settings are encoded into each individual in the population that evolves values
of parameters during the GA run.

Applications in Data Mining

Data mining has been used to analyze large datasets and establish useful classification
and patterns in the datasets. Agricultural and biological research studies have used
various techniques of data mining including natural trees, statistical machine learning
and other analysis methods. Genetic algorithm has been widely used in data mining
applications such as classification, clustering, feature selection, etc. Two applications of
GA in data mining are described below.

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
296
Genetic Algorithm and its Application in Data Mining

Effectiveness of the classification algorithms - Genetic algorithm, Fuzzy classification


and Fuzzy clustering are compared and analyzed on the collected supervised and
unsupervised soil data (Bhargavi and Jyothi 2011). Soil classification deals with the
categorization of soils based on distinguishing characteristics as well as criteria that
dictate choices in use. Soil classification is a dynamic subject, from the structure of the
system to the definitions of classes, and finally in the application in the field. GATree
and Fuzzy Classification rules were used for supervised learning. Classification based on
Fuzzy rules gives better performance than GATree. For Unsupervised learning Fuzzy C-
Means algorithm was used for classifying the soil data. This helps one to classify soil
texture based on soil properties effectively, which influences fertility, drainage, water
holding capacity, aeration, tillage, and bearing strength of soils.

S. C. Shah, A. Kusiak, 2004 have applied genetic algorithm for feature selection for
mining SNPs in association studies. Genomic studies provide large volumes of data
with thousands of single nucleotide polymorphisms (SNPs). The analysis of SNPs
determines relationships between genotypic and phenotypic information. It helps in
identification of SNPs related to a disease. An approach for predicting drug effectiveness
is developed that is based on data mining and genetic algorithms. A global search
mechanism, weighted decision tree, decision-tree-based wrapper, a correlation-based
heuristic, and the identification of intersecting feature sets with genetic algorithm are
employed for selecting significant genes. The feature selection approach has resulted in
85% reduction of number of features. The relative increase in cross-validation accuracy
and specificity for the significant gene/SNP set selected was found 10% and 3.2%. The
feature selection approach was successfully applied to data sets for drug and placebo
subjects. The number of features has been significantly reduced while the quality of
knowledge was enhanced.

References

P. Bhargavi, S. Jyothi (2011) Soil classification using data mining techniques: a


comparative study. International Journal of Engineering Trends and Technology,
July to August Issue, 2011.
D. E. Goldberg (1989) Genetic algorithms in search, optimization and machine
learning. Addison Wesley.
M. Mitchell (1996) An Introduction to Genetic Algorithms. MIT Press, MA.
S. C. Shah, A. Kusiak (2004) Data mining and genetic algorithm based gene/SNP
selection.
Artificial Intelligence in Medicine, 31, 183196.

Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets
297

S-ar putea să vă placă și