Sunteți pe pagina 1din 9

Feature selection in high-dimensional EEG data by

parallel multi-objective optimization


Dragi Kimovski1 Julio Ortega2
1

Andrs Ortiz3 Ral Baos4


3

University of Information, Science & Technology, Ohrid,


Macedonia
2
Dept. Computer Architecture and Technology, CITIC
University of Granada, Spain
e-mail: dragi.kimovski@uist.edu.mk, jortega@ugr.es

Dept. Ingeniera de Comunicaciones


University of Mlaga, Spain
4
Dept. Administracin y Direccin de Empresas
Universidad Catlica de Murcia
e-mail: aortiz@ic.uma.es rbanos@ucam.edu

AbstractFeature selection is required in many applications


that involve high dimensional model building or classification
problems. Many bioinformatics applications belong to this type.
Recently, some approaches for supervised and unsupervised
feature selection as a multi-objective optimization problem have
been proposed. As the performance of unsupervised classification
is evaluated through the quality of the obtained groups or
clusters in the data set to be classified, it is difficult to define a
suitable objective function that drives the selection of the
features. Thus, several evaluation measures, and thus multiobjective clustering characterization, could provide a suitable set
of features for unsupervised classification. In this paper, we
consider the parallel implementation of a multi-objective feature
selection that makes it possible to apply it to complex
classification problems such as those having many features to
select, and specifically high-dimensional data sets with much
more features than data items. In this paper, we propose masterworker implementations of two different parallel evolutionary
models, the parallel computation of the cost functions for the
individuals in the population, and the parallel execution of
evolutionary multi-objective procedures on subpopulations. The
experiments accomplished on different benchmarks, including
some related with feature selection in classification of EEG
(Electroencephalogram) signals for BCI (Brain Computer
Interface) applications, show the benefits of parallel processing
not only for decreasing the running time, but also for improving
the solution quality.
Keywords Feature selection, Multi-objective clustering,
Parallel multiobjective optimization, Wrapper methods.

I. INTRODUCTION
Dimensionality reduction is an important issue in
classification problems as it could reduce the computation cost
associated with training by using a large number of different
features that could be redundant, noisy or irrelevant, and it
could avoid the problem related with data sets having more
features than patterns (curse of dimensionality). Such
problems are frequent in bioinformatics applications as it is
described in [1], where a review on feature selection
techniques used in bioinformatics is provided along with
analyses and references of feature selection in different
bioinformatics applications such as sequence analysis,
microarray analysis, and mass spectra analysis. In [2], feature
selection in high-dimensional feature spaces with small pattern

978-1-4799-5548-0/14/$31.00 2014 IEEE

314

samples is considered in the searching of sets of genes whose


expression levels serve as feature sets for diagnosis or
prognosis. Dimension reduction in the input patterns has been
also applied to EEG classification for recognizing epileptiform
patterns [3]. Precisely, in EEG classification [4], solving
problems such as (1) the presence of noise or outliers in the
features (as EEG signals have a low signal-to-noise ratio); (2)
the need to represent time information in the features, as the
brain patterns are usually related to changes in time in the
EEG signals; (3) the non-stationarity of EEG signals, that may
change quickly over time or experiments; and (4) the low
number of patterns (EEGs) available for training as the
experimental work required to register the EEGs for different
events is time consuming, usually imply to increase the
dimensionality of the feature vectors. Thus, classification of
EEG signals, for example in BCI applications, has to be
accomplished from relatively few feature vectors of very high
dimensionality. This circumstance determines the so called
curse-of-dimensionality problem as the number of patterns
needed to properly define the different classes increases very
fast with the dimension of the feature vectors (from five to ten
times as many training samples per class as the dimension
[5]).
The approaches to dimensionality reduction can be
classified into two alternatives, the feature space
transformation through linear or non-linear transforms (such
as principal component analysis), and the selection of a subset
of features. This paper deals with feature selection in order (1)
to decrease the computational complexity of the procedure, (2)
to remove irrelevant/redundant features that would make more
difficult the learning of the classifier, or (3) to avoid the curse
of dimensionality in problems with many features and a low
number of available data to be classified [6].
The feature selection problem can be defined as the search
of the feature set that optimizes a cost function that evaluates
the utility of the selected features for a given clustering
problem taking into account the classification results obtained
with this selection. This is a wrapper approach that will be
considered in this paper (alternative approaches to wrapper
methods for feature selection are the filter approaches, which
propose representative utility measures for the features such as
the correlation or the mutual information between features).
In wrapper approaches, the evaluation of the utility for a given

set of features presents different issues depending on the type,


supervised or unsupervised, of the clustering procedure used.
If the procedure is supervised, is relatively easy to define the
utility cost function by using the classification error.
Nevertheless, in unsupervised procedures the utility should be
determined from a definition of clustering quality without
having knowledge about the corresponding labels or even the
number of clusters. Frequently, the quality measures for
clustering use ratios between intra-cluster compactness and
inter-cluster separation. Nevertheless, the distances between
points tend to similar values as the dimensions are higher and
so these quality solutions are biased towards lower dimension
solutions [6].

objective alternatives for feature selection. Finally, Section IV


provides and analyses the experimental results and Section V
gives the conclusions of the paper.
II. PARALLEL EVOLUTIONARY MULTI-OBJECTIVE OPTIMIZATION

This way, although, as it is indicated in [6], the


formulation of the feature selection problem as a multiobjective optimization problem could provide some
advantages, they depend on whether the classification
procedure is either supervised or unsupervised. In the
supervised classification procedures, the goal is usually the
maximization of the classifier performance while the number
of features is minimized as larger feature sets could produce
overfitting and low generalization problems. This way a multiobjective optimization approach that takes into account the
classifier performance and the number of features adequately
allows an adequate formulation of this goal.
The situation in unsupervised classification problems is
different. In this case it is difficult to evaluate the clustering
and, as it has been previously indicated, the applied validation
techniques usually present a dimensionality-bias to either
smaller or larger cardinality feature sets. Thus, a multiobjective approach could counterbalance the specific bias of
the considered cluster validation method.
Some works have been proposed in the formulation of
feature selection as a multi-objective optimization, either for
supervised or unsupervised classifiers. A very good review of
the alternatives and previous references on this topic is the
paper by J. Handl and J. Knowles [6]. With respect to
supervised classifiers, in [7, 8] multi-objective feature
selection procedures that takes into account the number of
features and the performance of the classifier are provided. In
the case of unsupervised classification, we have the papers
[6,9,10]. In [9], given a feature selection, the k-means
algorithm is used to build a clustering and it is evaluated from
four objectives (number of features, number of clusters,
compactness of the clusters, and separation between clusters).
The paper [10] also uses k-means for clustering and the
number of features and the Davies-Boulding Index (DBI) [11].
Along with a critical review of papers [9] and [10] and an
experimental study of different alternatives for unsupervised
feature selection with multi-objective optimization, [6]
provides a strategy to select (without external knowledge) the
more adequate solution from the obtained Pareto front
approximation.
In the paper, after this introduction to the issues of the
paper and the previous works in the area, Section II describes
the approaches here considered for parallel implementation of
evolutionary multi-objective optimization procedures based on
NSGA-II [25], while Section III proposes some parallel multi-

315

A multi-objective optimization problem can be defined as


the problem of finding a vector of decision variables xRn,
x=[x1,x2,...,xn], that satisfies a restriction set, g(x)0, h(x)=0,
and optimizes a function vector f(x), whose scalar values
(f1(x), f2(x),, fm(x)) represent the objectives to optimize.
These objectives are usually in conflict, and the concept of
optimum must be redefined in this context. Thus, instead of
providing only one optimal solution, the procedures applied to
these multi-objective optimization should obtain a set of nondominated solutions, known as Pareto optimal solutions, from
which a decision agent will choose the most convenient
solution in the current circumstances. These Pareto optimal
solutions are optimal in the sense that in the corresponding
hyper-area known as Pareto front, there is not any solution
worse than any of the other ones when all the objectives are
taken into account. Thus, they are non-dominated solutions.
TABLE I. PARALLEL ALTERNATIVES FOR EVOLUTIONARY ALGORITHMS
Parallel model

Distributed fitness
computation

Communication issues

Implementation

Frequency: each generation


Topology: through master
Information: solution fitness

Master-Worker
Frequency: {periodic, adaptive, probabilistic}
Topology: through master
Information: {solutions, searching memory,}

Concurrent
evolutionary
algorithms on
subpopulations

Frequency: {periodic, adaptive, probabilistic}


Topology: {unidirectional, bidirectional, mesh,
hypercube,}
Information: {solutions, searching memory,}

Island
(Coarse-Grain
model)

Frequency: {periodic, adaptive, probabilistic}


Topology: {unidirectional, bidirectional, mesh,
hypercube,}
Information: {solutions, searching memory,}

Diffusion
(Fine-Grain
model)

In a multi-objective formulation of the feature selection


problem, the possible features define the vector of decision
variables. The space of features in many classification
problems is usually very large. Moreover, in a wrapper
approach for feature selection, the fitness evaluation of an
individual (a given set of selected features) of the population
requires to implement the learning procedure to that
determines the characteristics of the classifier and then to
evaluate its performance. This way, both causes (the high
volume of feature selection alternatives and the time
requirements for fitness evaluation of the population)
determine high computing times to obtain a good enough
approximation to the Pareto front of non-dominated sets of
features, and an efficient parallel implementation of the multiobjective optimization procedure would be very useful.
Here, we consider two possible models to parallelize
evolutionary algorithms in the context of the data
decomposition approach (see Table I): (1) the distribution of
fitness computation; and (2) the concurrent execution of
evolutionary algorithms over multiple subpopulations.

These two parallelization alternatives can be implemented


by a master-worker approach. In the case of a distributed
fitness computation, the master process distributes the
population and executes the rest of steps in the evolutionary
iterations while the other processors (the workers) only
evaluate the fitness of the individuals of the population. Thus,
iterations imply communication between the master and the
workers to distribute the population and receive the fitness
values. In the second parallelization alternative (concurrent
execution of evolutionary algorithms over multiple
subpopulations), the master distributes the subpopulations
among the workers that, after a given number of independent
iterations in parallel, communicate themselves through the
master. The parallel execution of evolutionary algorithms on
subpopulations can be also implemented by assigning each
subpopulation to a different processor that independently
performs some iterations of the evolutionary algorithm on the
subpopulation (as in the master-slave implementation of this
parallel model) but it communicates with other processor to
send or receive information according to a strategy that
defines the communication frequency and the information to
be exchanged. As our goal in this paper is to get insight in the
benefits of parallel multi-objective optimization for feature
selection, here we do not consider issues related with
implementations and thus, we have compared the two parallel
models by using master-worker implementations.
An important issue in the parallel model of concurrent
execution of evolutionary algorithms over multiple
populations is the distribution of individuals among those
different subpopulations. Basically, there are two options: (1)
each process uses individuals belonging to the whole search
space; and (2) each process explores a given different part of
the search space. The first option is quite similar to run the
sequential algorithm a number of times equal to the number of
workers, although with less individuals in their populations
and with communication among the workers after some
independent iterations. The second option is quite interesting,
as it would allow the best use of the resources by avoiding that
more than one process searches on the same area. However, it
is also difficult to develop a working procedure that allows
each process to search only on a specifically limited and
independent area and mixed approaches where processes focus
on some part of the search space although some overlapping
may occur have been also considered. Due to space limitations
we do not comment more details on these issues. Some papers
on this researching area are [12-18].
Algorithms 1 and 2 provide pseudo-codes of two
evolutionary (multi-objective) optimization parallel models in
Table 1. In Algorithm 1, the master process sends
subpopulations of individuals to the worker processes in
sentences 02-04 for a parallel fitness evaluation for the
individuals in the initial population, and in sentences 12-14,
inside each iteration of the do-while. The detailed descriptions
of sentences 10 and 11 depend on the characteristics of the
implemented multi-objective evolutionary algorithm (MOEA),

316

that determine the way the individuals are selected for the next
iteration (whether only non-dominated ones are selected, a
solutions archive is used, etc.). The distribution of the
individuals among the different subpopulations is not
important in this case, as the evaluation of the cost functions is
completely independent for each individual. Only
considerations for load balancing should be taken into account
whenever possible differences in the computational costs for
different individuals are known.
Master process
01
Initialize a Population composed of P subpopulations, SP[i] (i=1,..,P), of N/P individuals
02
for i=1 to P workers
03
Send the i-th subpopulation SP[i] to Worker[i];
04
end;
05
t=1;
06
do
07
for i=1 to P workers
08
Receive subpopulation SP[i] from Worker[i];
09
end;
10
Execute one iteration of MOEA on the population (SP[1] U SP[2] UU
SP[P])
11
Distribute the population into new subpopulations SP[i] (i=1,..,P) of N/P
individuals
12
for i=1 to P workers
13
Send the i-th subpopulation SP[i] to Worker[i];
14
end;
15
t=t+1;
16
while stop criterion is not reached;
Worker[i]
01
while true
02
Receive subpopulation SP[i] from Master process
03
Evaluation of individuals in SP[i]
04
Send subpopulation SP[i] to Master process
05
end;

Algorithm 1. Pseudo-code for the parallel fitness evaluation alternative


Master process
01
Initialize a Population composed of P subpopulations, SP[i] (i=1,..,P), of N/P individuals
02
for i=1 to P workers
03
Send the i-th subpopulation SP[i] to Worker[i];
04
end;
05
t=1;
06
do
07
for i=1 to P workers
08
Receive subpopulation SP[i] from Worker[i];
09
end;
10
Combine_&_Distribute (SP[1], SP[2],,SP[P]);
11
for i=1 to P workers
12
Send the i-th subpopulation SP[i] to Worker[i];
13
end;
14
t=t+1;
15
while stop criterion is not reached;
Worker[i]
01
while true
02
Receive subpopulation SP[i] from Master process
03
Execute MOEA on SP[i] for genpar iterations
04
Send subpopulation SP[i] to Master process
05
end;

Algorithm 2. Pseudo-code for the concurrent evolution of subpopulations


alternative

In sentence 03 of the Worker process in Algorithm 2, each


worker independently executes genpar iterations of the
implemented MOEA by using one of the subpopulations
initialized by the master process in sentence 01 of Master
process. Then, in the following do-while, the function
Combine_&_Distribute is applied by the master in sentence 10
to take advantage of the characteristics of the different
subpopulations computed by the workers in their previous
genpar independent iterations. It also generates the new
subpopulations that will evolve in parallel (sentence 03 in the
Worker
process). The
details
of
the
function
Combine_&_Distribute define different alternatives for the
cooperation among processors in the search of the best feature
selection and could depend on the application at hand. In the

next section, we describe three alternatives we propose to be


used in the feature selection problem.

Input
patterns

Population: set of feature


selections

x11,...,x1F

Population: set of feature


selections

x11,...,x1F
Input
patterns

x21,,x2F
x31,...,x3F
x41,...,X4F

xN1,...,xNF

Features selected
(by individual i):
xi1, xi2,.. xiF

Clustering
Algorithm
(SOM)
Learning
Iterations

Evaluation of objectives
f1, f2,..,fn of the
individuals (feature
selections)

Learning
Iterations

Features selected
(by individual i):
xi1, xi2,.. xiF

III. PARALLEL MULTIOBJECTIVE FEATURE SELECTION


Figure 1 provides a simplified scheme of a multi-objective
optimization procedure for feature selection in a clustering
algorithm that, for example, could be a self-organizing map
(SOM) [20], as we have considered in our experiments. Each
individual of the population codifies the features of the input
patterns that are taken into account in the SOM training. The
evaluation of each individual (a feature selection) implies to
train a SOM with the given input patterns and the evaluation
of the performance of the obtained clustering by using several
cost functions, as a multi-objective optimization procedure has
been considered. These training processes usually require a
high amount of computing time and a parallel approach to this
procedure seems to be fully justified.

Clustering
Algorithm

Input
patterns

Clustering
Algorithm

xN1,...,xNF

Learning
Iterations

Features selected
(by individual i):
xi1, xi2,.. xiF

Evaluation of objectives f1, f2,..,fn of the


individuals (feature selections)

Evolutionary
operators
+
Selection of
individuals

Figure 2. Parallel procedure with distributed fitness computation for


unsupervised multiobjective feature selection based on SOM (MWE)

In many applications, the pattern components are features


extracted from signals measured by different sensors or
sources. For example, in EEG classification problems, we will
have NE electrodes, and the EEG signal obtained by each
electrode is characterized by NF features. Thus, the number of
features to be selected is F=NExNF, and we arrange these
features in such a way that consecutive features correspond to
the same source. Thus, the components of any given
individual in the population can be noted as:
Xi =(xi(1,1),..,xi(1,NF),xi(2,1),.,xi(2,NF),,xi(NE,1),.,xi(NE,NF)) (1)

Evolutionary
operators
+
Selection of
individuals

Subpulation: subset of
feature selections

Subpulation: subset of
feature selections

x11,...,x1F

x11,...,x1F
x21,,x2F

Clustering
Algorithm
(SOM)

x21,,x2F

x31,...,x3F

Learning
Iterations

x31,...,x3F

Input
patterns

x41,...,X4F

Figure 1. Wrapper procedure for unsupervised feature selection by


evolutionary multi-objective optimization

r=N/P
Xr1,...,xrF

Figures 2 and 3 show schemes of implementations of


parallel procedures described in Algorithm 1 and Algorithm 2
in the feature selection application here considered. In what
follows, we describe the alternatives proposed to distribute the
searching space and allow cooperation among processes in a
feature selection problem.
In a feature selection problem, the evolutionary multiobjective algorithm uses a population of individuals,
Population={X1,X2, ..,XN}, where each individual has as
many binary components as features, thus Xi=
(xi1,xi2,xi3,.,xiF) where F is the number of features, and xij=1
whenever the j-th feature (j=1,2,..,F) is selected and xij=0
otherwise. The space where a subpopulation evolves can be
defined by setting some of the components of the individuals
to specific constant values (the multi-objective evolutionary
algorithm, MOEA, does not change them), while the
remaining components can change, and define the space where
the subpopulation evolves through.

Features selected
(by individual i):
xi1, xi2,.. xiF

Evaluation of objectives
f1, f2,..,fn of the
individuals (feature
selections)

Evolutionary
operators
+
Selection of
individuals

Input
patterns

Features selected
(by individual i):
xi1, xi2,.. xiF

x41,...,X4F

r=N/P
Xr1,...,xrF

Clustering
Algorithm
(SOM)
Learning
Iterations

Evaluation of objectives
f1, f2,..,fn of the
individuals (feature
selections)

Evolutionary
operators
+
Selection of
individuals

P processors

Combine_&_Distribute
Figure 3. Concurrent evolution of subpopulations for unsupervised
multiobjective feature selection based on SOM

A subpopulation of individuals that keep some of their


components set to a given values while allow changes in other
components, can be used to explore the corresponding
subspace. Thus, if changes are only allowed in the components
that correspond to the features corresponding to a given sensor
or source (for example an electrode in an EEG), the
subpopulation will search possible feature selections for this
electrode taking into account the set of features selected for
the other sensors. If the values of all the constant components
were zero, then the evolutionary algorithm searches only in
the set of features of this sensor, which is considered
independent from the others.
The way the parallel subpopulations interact and evolve
after this interaction in Algorithm 2 depends on the
characteristics of the function Combine_&_Distribute of
Algorithm 2. Before describing the alternatives proposed in

317

this paper to implement this function, we introduce some


questions related to notation.
Without loss of generality, it can be supposed that the
features whose selection is explored by a given processor
through their assigned subpopulation are correlatives. This
situation holds, for example, in problems of classification of
multivariate signals such as EEGs or those that classify signals
from several sensors, whenever the components corresponding
to features of the same signal (from a given electrode or
sensor) are to be explored by the same processor. In this case,
it is possible to note the i-th individual of the population as:
Xi = (xi(1),xi(2),..,xi(j),..,xi(NE))

the processor j-th (j=1,,P) receives its corresponding


subpopulation j and evolves it, by only modifying the bits of
xi(j)= (xi(j,1),..,xi(j,NF))) in the individuals of the population
(i=1,..,r), along a given number of generations. Then each
processor obtains a new subpopulation j=1,..,P.
The Combine_&_Distribute function for this ALT1
receives the P subpopulations computed by the P different
processors. All this subpopulations are combined and a new
population of N/P individuals is built from the individuals
included in any of the combined P subpopulations. Different
alternatives have to be taken into account to build this new
population.

(2)

where xi(j) = (xi(j,1),..,xi(j,NF)), and NE is the number of signals


and NF the number of features for each signal. It will be
supposed that NE=P (P being the number of processors). If
NE>P more than one signal will be allocated to a processor,
and if NE<P several processors can be devoted to a given
signal (the NF features of each signal are distributed among
several processors). Nevertheless, in both cases, the changes in
the Combine_&_Distribute function can be easily derived
from the description here provided for the case NE=P.

New population of
individuals after
first combination
(ALT1)

Non-dominated
Solutions found
by P1 (after first
set of
independent
iterations)
Non-dominated Solutions found by
P2 (after first set of independent
iterations)

The i-th individual in a subpopulation that only explores


the features of the j-th electrode of the EEG can be noted as:
Xji(Kji)=(Kji(1), Kji(2),..,xi(j),.., Kji(NE))

(3)

where Kji(k)=(Kji(k,1),..,Kji(k,NF)), being the components Kji(k,l)


constants (equal to 0 or 1), and Kji is the number whose binary
representation is built by all the constant bits in the
representation of the individual, which is noted as
(Kji(1),..,Kji(j-1),Kji(j+1),..,Kji(NE)). Whenever all the constant
components are equal for all the individuals (i=1,,r, with
r=N/P) of the subpopulation, an individual can be noted as:
Xji(Kj)=(Kj(1),Kj(2),..,xi(j),..,Kj(NE))
j

(4)

where K (k) = (K (k,1),..,K (k,NF)), K (k,l) are constants (0 or 1),


and thus Kj(k) is the number whose binary representation is
(Kj(k,1),..,Kj(k,NF)). This way, the search space for the j-th
subpopulation is defined by all the constant bits in the
representation of the individual in this subpopulation (these
values are the same for all the individuals in the
subpopulation).

Areas explored by
P1 in the second set
of independent
iterations (dot lines)
Non-dominated
solutions found by
P1 after the second ,
set of independent
iterations (circle
points)
Non-dominated solutions
found by P2 after the
second set of
independent iterations
(square pints)

Areas explored by
P2 in the second set
of independent
iterations
(continuous lines)

Figure 4. Example of combination in ALT1 with two subpopulations: after


the first independent iterations and combination (above) and after the second
independent iterations (below)

Three different alternatives, ALT1 to ALT3, have been


considered to combine the subpopulations obtained by the
different processors after a given number of independent
MOEA iterations. The way they work is explained in what
follows.

If there are more non-dominated individuals than the


maximum number of individuals allowed in the new
population, a procedure to drop the individuals closest to
another one is applied [15]. Moreover, if the number of nondominated individuals is lower than the number of individuals
allowed in the combined population, some individuals are
randomly selected from the second level of non-dominance
(new non-dominated individuals that appear once the present
non-dominated individuals are not considered), and so on. The
final population of N/P solutions obtained is sent again to all
the processors.

ALT1: An initial set of P subpopulations is defined. The


subpopulation assigned to processor j-th includes the r=N/P
individuals {Xj1(Kj),X j2(Kj),,Xjr(Kj)} (j=1,,P). This way,

This way, each processor j receives a subpopulation of


individuals (yi(1),yi(2),..,yi(j),..,yi(NE)) (i=1,..,N/P) where yi(k) =
(yi(k,1),..,yi(k,NF)), although it only modifies the bits

318

(yi(j,1),..,yi(j,NF)) of the individuals (i=1,..,r=N/P) of its


subpopulation. Thus, once a given number of evolutionary
iterations have been independently executed by each
processor, and after the function Combine_&_Distribute, a
processor starts a new set of independent iterations of the
multi-objective optimization problem by using the same
population of individuals, although different processors
explore the space defined by different components, thus
obtaining a different subpopulation after the iterations with
independent evolutionary executions.
An example about the evolution of the explored search
space in the case of two subpopulations with individuals Xi
=(xi(1,1),..,xi(1,NF),xi(2,1),.,xi(2,NF)) is provided in Figure 4. In
this figure, the two dimensions corresponds to the values of
(xi(1,1),..,xi(1,NF)) and (xi(2,1),.,xi(2,NF)), respectively, and the
initial sets of populations are (xi(1,1),..,xi(1,NF),k(2,1),.,k(2,NF))
and (k(1,1),..,k(1,NF),xi(2,1),.,xi(2,NF)) with i=1,...,N/2.
The next two alternatives are proposed to take into account
the characteristics of applications whose features can be
grouped into bundles of features corresponding to a different
signal (as a sensor, or an electrode in an EEG classification
problem), and feature selection could be done almost
independently for each signal.
ALT2: In this alternative, each processor j-th receives the
individuals Xij(K)=(K(1),K(2),..,xi(j),..,K(NE)) (i=1,..,N/P). This
means, that individuals including only the features that
correspond to the signal j-th, (xi(j,1),..,xi(j,NF)), define the
subpopulation j, while the rest of components are set to
constant values which (in each position) are the same for all
the individuals in the subpopulation.
In the function Combine_&_Distribute, the non-dominated
solutions obtained by all processors are taken into account and
define a population of N individuals, (yi(j,1),..,yi(j,NF))
(i=1,...,N/P, j=1,...,P). Once a processor j receives the
population of N individuals it evaluates the individuals and
selects
a
subpopulation
of
N/P
individuals,
(K(1),K(2),..,yi(j),..,K(NE)) (i=1,...,N/P, j=1,...,P), to proceed with
the next independent iterations. This way, each processor j-th
continues improving its subpopulation of N/P individuals
(yi(j,1),..,yi(j,NF)), i=1,..,N/P, with values K(k) (k=1,..,j-1,j+1,..,P)
fixed for all the individuals, i=1,..,N/P, across their next
independent iterations.
ALT3: As in ALT2, each processor j-th receives a
subpopulation of individuals Xij(K)=(K(1),K(2),..,xi(j),..,K(NE))
(i=1,...,N/P) and only explores xi(j)=(xi(j,1),..,xi(j,NF)) while keeps
constant the values in the rest of components which are also
the same for all the individuals in the population. This way
K=(K(1),..,K(j-1),K(j+1),..,K(NE)) for all i=1,...,N/P and j=1,..,P.
The function Combine_&_Distribute in this alternative
generates a new subpopulation for each processor j-th by
choosing the non-dominated individuals of its corresponding
subpopulation, and it also include a constant number, q, of

319

non-dominated solutions selected from the chunks explored by


the other processors kj, xs(k)=(xs(k,1),..,xs(k,NF)) (s=1,..,q;
k=1,..,j-1,j+1,..,P). Once a processor j receives the population
of N/P individuals, it continues evolving the subpopulation
across several independent iterations to improve it.
The feature selection problem for multivariate signals
where these signals are independent can be implemented by
either ALT2 or ALT3, whenever K(k)=0 is set for all k=1,..,P.
Moreover, ALT2 can be considered a special case of ALT3
where instead of selecting a number of non-dominated
individuals from other populations, the whole set of solutions
in the rest of subpopulations is considered.
Figure 5 provides a scheme corresponding to the function
Combine_&_Distribute for ALT2 and two subpopulations,
similar to that provided in Figure 4 for ALT1. In the case of
ALT3 the corresponding figure would be similar to Figure 5
although not all the solutions found by one subpopulation are
transferred to the others.
Non-dominated solutions found by
P2 (after genser iterations of
evolution of the subpopulation )

Non-dominated solutions found by


P1 (after genser iterations of
evolution of the subpopulation )

Combination of solutions
in the subpopulations of
P1 and P2 (after genser
independent iterations)
and assigned to P2

Figure 5. Example of combination in ALT2 with two subpopulations: after


the first independent iterations and combination (above) and after the first
combination step in P2 (below)

IV. EXPERIMENTAL RESULTS


In this section, the proposed parallel procedures are
evaluated by using a set of benchmarks. The computer used to
execute the codes is a cluster with nodes based on nodes with
two Intel Xeon E5520 processors (4 cores and 2 threads per
core) at 2.7 GHz, 16GB RAM per node, and connected by
Infiniband. As this paper deals with the parallel
implementation of feature selection as a multi-objective
optimization problem, and due to space limitations, we do not
provide a detailed comparison of our multi-objective approach
with other previously proposed procedures. It would require
using a high number of sufficiently representative

benchmarks. With respect to this issue, we only give


experimental results that demonstrate the effectiveness of our
approach to select adequate sets of features. Then, we analyse
the results of quality and speedup for the different parallel
alternatives here proposed.
The experiments have been accomplished by using several
synthetic benchmarks (they can be obtained upon request to
the authors), and the dataset 2D motion provided at [19] for a
BCI motor imagery application. They correspond to
classification problems with a high number of features,
respectively 152, 384, and 512 features for the benchmarks
b152, b384 and b512. In the experiments, the classifier we use
in our proposed feature selection procedures uses selforganized map (SOM) [20] for clustering and its performance
has been evaluated by two cost functions that, this way, define
a multi-objective optimization problem. In the case of feature
selection in a supervised classification problem, as the labels
of the patterns are known, the classification error in the set of
test patterns has been used as one of the cost functions. The
number of selected features is the second cost function in this
case. Whenever the patterns are not labelled, an unsupervised
feature selection procedure has been implemented where the
two cost functions are based on the properties of the obtained
clustering. More specifically, these cost functions take into
account the distances from each vector to its nearest one: the
smaller are the distances per individual in close individuals
and the bigger are the distances per individual in far ones, the
better is the clustering.
TABLE II. PERFORMANCE ACHIEVED BY DIFFERENT FEATURE SELECTION AND
CLASSIFICATION METHODS (FOR BENCHMARK B512)

low number of patterns (200) it is not a very difficult


classification problem as our purpose were to check whether
our methods are able to select the features that provide the best
classification results. The comparison has been done by using
accuracy measures for each class (C1 to C4) [21],
corresponding to each of the four classes in b512, and the
kappa coefficient value [23], that provides a performance
evaluation of the classifier that take into account the per class
error distribution. As higher values for these measures means
better performances, it is clear from Table II that our proposed
multi-objective feature selection procedures are competitive
with the other state-of-the-art procedures.
Table III provides the averages of the hypervolume metric
(higher is better) for 2, 4, 6, and 8 processors, obtained after
the experiments with the unsupervised classification
alternative by using the benchmark b152, derived from [19],
and the synthetic benchmarks b384 and b512. The points to
evaluate the hypervolumes have been obtained from the
estimated worst bounds for the cost functions in the different
benchmarks. Bold values in Table III correspond to the cases
where better qualities are obtained in the parallel
implementation than in the sequential one. Any experiment
with different parallelization alternative, benchmark, and
number of processors has been repeated 15 times. In the
sequential case, the hypervolume averages were, respectively,
157.13 for b152, 34.44 for b384, and 70.41 for b512. The
average execution times for b152 and b384 are, respectively,
3756.53 and 6888.68 seconds. As it can be seen, all the
alternatives reach quality measures having less than a 10% of
difference with respect to the average quality obtained by the
sequential alternative, except in the case of b152 and ALT3,
with 2 processors (12.83% worst than the sequential
alternative). Nevertheless, ALT1 allows better average
qualities than sequential executions in case of 2 and 4
processors, and MWE with 2, 4, 6 and 8 processors in the
benchmarks b384 and b512.
TABLE III. PERCENTAGES OF HYPERVOLUME DEVIATIONS WITH RESPECT TO
THE SEQUENTIAL AND PARALLEL ALTERNATIVES (UNSUPERVISED
CLASSIFICATION): MWE (MASTER-WORKER-EVALUATION) AND ALT1, ALT2,
AND ALT3 (PARALLEL SUBPOPULATION EVOLUTION).

Table II compares, by using the benchmark b512, the


multi-objective feature selection procedures here proposed,
Supervised MO and Unsupervised MO (in bold characters in
Table II), with other approaches for feature selection such as
Backward FS [21], ReliefF [24], and PCA [21], along with
SOM and SVM [22] as clustering/classification methods. In
the table, (i)/k refers to non-dominated pareto i-th solution
corresponding to k selected features. Although the benchmark
b512 correspond to a problem with many features (512) and a

320

The processing times for the different alternatives


correspond to a similar number of evaluations in all of them.
Figure 6 shows the average efficiencies for the different
benchmarks and number of processors. As it can be seen, the
alternatives ALT2 and ALT3 obtain superlinear values for 4,
6, and 8 processors and all the benchmarks evaluated, and
even for 2 processors in the b512 benchmark. From Figure 6,
it can be concluded that the speedups obtained by the different
parallel alternatives seem to increase as higher is the number
of features in the benchmarks. Moreover, taking into account
the quality of the Pareto fronts derived from Table III, a kind
of trade-off arises between solution quality and speedup with
respect to the sequential alternative (better acceleration at the
cost of a decrease in the quality of the solutions).
Figure 7 shows some of the Pareto fronts approached by
the different alternatives. It mixes some best approximations
and worst ones for different parallelization alternatives and
benchmark b152 (similar behaviour is shown in the other
benchmarks). As it can be seen, the obtained approximated
Pareto fronts are quite similar and intermixed. Nevertheless,
the best approximated Pareto front obtained by ALT1 seems to
be one of the best results.

the sequential and ALT1 alternative, are significant (specially


the ALT2 and ALT3 cases). The ALT1 option does not
provide results statistically different from the sequential case
(p=0.2364 for b152), and something similar happens with the
MWE case (p=0.0902 for b152). MWE behaves as expected,
because it only distributes the evaluation of the solutions, but
in the case of ALT1, the results of statistical significance show
that despite ALT1 behaves differently than the sequential case,
it can provide similar solution qualities.

Efficiency

Figure 7. Best (B) and worst (W) Pareto fronts obtained by the different
parallel alternatives (SQ: sequential, MWE, ALT1, ALT2, and ALT3) in
different runs for benchmark b152

b152
b384
b512

MWE

ALT2

MWEALT1ALT2 ALT3 MWEALT1ALT2 ALT3 MWEALT1ALT2 ALT3 MWEALT1ALT2 ALT3 Parallel

ALT1

ALT3

Alternative
Processors

Figure 6. Efficiency averages for different benchmarks, number of


processors, and parallel alternatives: MWE (master-worker-evaluation) and
ALT1, ALT2, and ALT3 (parallel subpopulation evolution).

Figure 8. Plots for normal probability testing of quality results in the


experiments with MWE, ALT1, ALT2, and ALT3 (for b152).

The statistical significance of the results has been analyzed


by applying a Kolmogorov-Smirnoff test. The results of this
test indicate that the quality measures do not follow normal
distributions (although Figure 8 shows the corresponding
results for the benchmark b152, the same behavior has been
shown in the other benchmarks). Thus, we have applied a
Kruskal-Wallis analysis instead of an ANOVA test. The
results have shown that only the differences among the
sequential and ALT2 and ALT3 are statistically significant
(p=0.0011 and p=0.0014 for b152). Thus, the slight worse
quality results obtained by ALT2 and ALT3, with respect to

V. CONCLUSIONS
Parallel implementations of a wrapper procedure based on
multi-objective optimization for feature selection in
supervised and unsupervised classification has been described
and evaluated by using different synthetic benchmarks and
benchmarks for EEG signals classification in BCI
applications.
Four parallelization alternatives of an evolutionary multiobjective procedure based on NSGA-II have been considered.
The MWE alternative corresponds to a model of concurrent
evaluation of the individual fitness with a master-worker

321

implementation while ALT1, ALT2, and ALT3 models


distribute the multi-objective evolutionary optimization among
subpopulations that evolve independently although
communicate themselves after some iterations.
Although, depending on the benchmark considered, the
quality of the solutions has been improved with respect to the
sequential executions in some cases of MWE (with 2, 4, 6 and
8 processors) and ALT1 (with 2 and 4 processors), these
differences cannot be considered statistically significant. In
the rest of cases, worst quality results have been obtained by
the parallel alternatives, although the deviations have been
under 10% in all but one case (ALT3 with 2 processors and
one of the benchmarks). With respect to the speedups, in all
cases some speedup have been achieved, although the highest
efficiencies have been obtained by ALT2 and ALT3, where
even superlinear speedups have been obtained for 4, 6 and 8
processors. Thus, it seems to be a trade-off between speed and
solution quality. It is possible to obtain good speedups
whenever some reduction in the solution quality is allowed
(and in this case ALT2 or ALT3 could be used). Nevertheless,
ALT1 can be considered a good option as it provides speedups
and solutions of enough good quality. Indeed, as it has been
said, the provided statistical analysis shows that the quality of
ALT1 and sequential versions can be considered similar with
respect to the solutions quality.
The study of new alternatives for the function
Combine_&_Distribute and a wider experimentation with a
bigger set of high-dimensional feature selection benchmarks in
EEG classification and bioinformatics are important issues for
our future work. Nevertheless, many other issues on multiobjective clustering, mainly related with the definition of the
cost functions that evaluate the clustering quality, also
constitute important researching areas.

This work has been funded by project TIN2012-32039


(Spanish Ministerio de Economa y Competitividad and
FEDER funds) and P11-TIC-7983 (Junta de Andaluca).
The authors would like to thank the reviewers for their
comments and suggestions to improve the paper.

[3]

[4]

[5]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[17]

[18]

References

[2]

[7]

[16]

Acknowledgment

[1]

[6]

Y. Saeys; I. Inza; P. Larraaga: A review of feature selection


techniques in bioinformatics. Bioinformatics, Vol. 23, N.19, pp.25072517, 2007.
C. Sima; E. Dougherty: What should be expected from feature selection
in small-sample settings. Bioinformatics, Vol. 22, pp. 2430-2436, 2006.
N. Acir, C. Gzeli An Application of Support Vector Machine in
Bioinformatics: Automated Recognition of Epileptiform Patterns in EEG
Using SVM Classifier Designed by a Perturbation Method. Advances
in Information Systems. LNCS Vol. 3261, pp 462-471. 2005.
F. Lotte; M. Congedo; A. Lcuyer; F. Lamarche; B. Arnaldi:A Review
of Classification Algorithms for EEG-based Brain-Computer
Interfaces. Journal of Neural Engineering, 4, 2007.
S.J. Raudys; A.K. Jain: Small sample size effects in statistical pattern
recognition: Recommendations for practitioners. IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 13, No. 3, pp.252-264,
1991.

322

[19]
[20]
[21]
[22]
[23]
[24]
[25]

Handl, J.; Knowles, J.:Feature selection in unsupervised learning via


multi-objective optimization. Intl. Journal of Computational
Intelligence Research, Vol.2, No.3, pp.217-238, 2006.
C. Emmanouilidis,A. Hunter, and J. MacIntyre: A multiobjective
evolutionary setting for feature selection and a commonality-based
crossover operator. In Proceedings of the 2000 Congress on
Evolutionary Computation, IEEE Press, New York, NY, pp. 309316,
2000.
L. S. Oliveira, R. Sabourin, F. Bortolozzi, and C. Y. Suen. A
methodology for feature selection using multiobjective genetic
algorithms for handwritten digit string recognition. International
Journal of Pattern Recognition and Artificial Intelligence, 17(6), pp.
903929, 2003.
Y. Kim, W. N. Street, and F. Menczer. Evolutionary model selection in
unsupervised learning. Intelligent Data Analysis, 6(6), pp. 531556,
2002.
M. Morita, R. Sabourin, F. Bortolozzi, and C. Y. Suen. Unsupervised
feature selection using multi-objective genetic algorithms for
handwritten word recognition, In Proceedings of the Seventh
International Conference on Document Analysis and Recognition, IEEE
Press, New York, NY, pp. 666671, 2003.
J. L. Davies and D. W. Bouldin. A cluster separation measure. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 1, pp. 224
227, 1979.
K. Deb, P. Zope, A. Jain:"Distributed computing of Pareto-optimal
solutions using multi-objective evolutionary algorithms". In Proc. Of the
Second Multi-Criterion Optimization (EMO-03) Conference. LNCS
2632, pp.535-549, Springer, 2003.
J. Branke; H. Schmeck; K. Deb; R.S. Maheshwar:"Parallelizing
multiobjective evolutionary algorithms: cone separation". In Proc. Of the
Congress on Evolutionary Computation, pp.1952-1957, IEEE Press,
New York, 2004.
T. Hiroyasu, M. Miki; S. Watanabe; "The new model of parallel genetic
algorithm in multiobjective optimization problems - divided range
multiobjective genetic algorithm". In Proc. Of the Congress on
Evolutionary Coputation, pp-333-340, IEEE Press, New York, 2000.
M. Cmara; F. de Toro; J. Ortega: An Analysis of Multiobjective
Evolutionary Algorithms for Optimization
Problems with Time
Constraints. Applied Artificial Intelligence, Vol. 27, No.9, pp.851-879,
2013.
F. Streicher; H. Ulmer, A. Zell:"Parallelization of multi-objective
evolutionary algorithms using clustering algorithms". In Proc. Of the
Third Multi-Criterion Optimization (EMO-05) Conference, LNCS 3410,
pp.92-107, Springer, 2005.
M. Cmara, J. Ortega, F. de Toro: Comparison of Frameworks for
Parallel Multiobjective Evolutionary Optimization in Dynamic
Problems. In (Fernndez de Vega, F.; Hidalgo, J.I.; Lanchares, J., Ed.)
Parallel Architectures and Bioinspired Algorithms, Studies in
Computational Intelligence, Vol. 415, pp. 101-123, 2012.
L.T. Bui; H.A. Abbass; D. Essam:"Local models - an approach to
distributed multi-objective optimization". Comput. Optim. Appl., Vol.
42, pp. 105-139, 2009.
https://sites.google.com/site/projectbci/
T. Kohonen: "Self-Organizing Maps". Springer, 2001.
S. Theodoridis; K. Koutroumbas:"Pattern Recognition". Academic
Press, 2009.
V. N. Vapnik:"Statistical Learning Theory".Wiley-Interscience, 1998.
J. Cohen: A coefficient of agreement for nominal scales. Educ.
Psychological Meas., Vol.20, pp.37-46.
R.S. Marko, K. Igor:"Theoretical and empirical analysis of relief and
reliefF". Machine learning journal. Vol. 53, pp. 23-69. 2003.
K. Deb; S. Agrawal; A. Pratab; T. Meyarivan:A fast elitist Nondominated Sorting Genetic Algorithms for multi-objective optimisation:
NSGA-II. Proc. Of the 6th Int. Conference on Parallel Problem Solving
from Nature (PPSN VI), LNCS 1917, pp.849-858, Springer-Verlag,
2000.

S-ar putea să vă placă și