Documente Academic
Documente Profesional
Documente Cultură
PII: S0957-4174(17)30447-5
DOI: 10.1016/j.eswa.2017.06.030
Reference: ESWA 11401
Please cite this article as: Mohammed Aladeemy, Salih Tutun, Mohammad T. Khasawneh, A new hy-
brid approach for feature selection and Support Vector Machine model selection based on Self-Adaptive
Cohort Intelligence, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.06.030
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
T
• The proposed algorithm employs self-adaptive scheme and mutation op-
IP
erator.
CR
• A new hybrid approach for feature selection and SVM model selection is
proposed.
1
ACCEPTED MANUSCRIPT
T
State University of New York at Binghamton
IP
Department of Systems Science and Industrial Engineering
Binghamton, NY 13902, USA
CR
Abstract
US
This paper proposes a new hybrid approach for feature selection and Support
Vector Machine (SVM) model selection based on a new variation of Cohort
AN
Intelligence (CI) algorithm. Feature selection can improve the accuracy of
classification algorithms and reduce their computation complexity by remov-
ing the irrelevant and redundant features. SVM is a classification algorithm
M
that has been used in many areas, such as bioinformatics and pattern recog-
nition. However, the classification accuracy of SVM depends mainly on tun-
ing its hyperparameters (i.e., SVM model selection). This paper presents a
ED
from the literature and compared with those of CI and other five well-known
metaheuristics, namely, Genetic Algorithm (GA), Particle Swarm Optimization
(PSO), Differential Evolution (DE) and Artificial Bee Colony (ABC). The com-
parative results show that SACI outperformed CI and comparable to or better
T
than the other compared metaheuristics in terms of the SVM classification ac-
IP
curacy and dimensionality reduction.
Keywords: Feature Selection; SVM; Classification; Cohort Intelligence;
CR
Metaheuristic
1 1. Introduction
3
US
Supervised classification has been utilized in many different areas such as
disease diagnosis, investment risks, and customer profiling. In classification
AN
4 problems, a decision function is defined using training instances (data points
5 or samples) and its output is the predicted labels (classes) of these instances
6 (Kanamori et al., 2013). Then, the defined decision function is used to predict
M
12 fore, not all of them are crucially significant for prediction (Lin et al., 2008).
13 Moreover, analyzing a dataset with large number of features is computationally
14 expensive and can degrade the performance of the classification algorithm (Zhu
CE
15 et al., 2013; Houari et al., 2016). In this case, the dimensionality reduction
16 becomes a fundamental step that can be achieved by feature selection strategies
(Houari et al., 2016; Zhu et al., 2013). Feature selection is the process of se-
AC
17
18 lecting a smaller features subset that is sufficient to accurately predict the class
19 labels of the training set.
20
3
ACCEPTED MANUSCRIPT
T
θ ∈ Rd , that maps X into the target (class) vector, y ∈ Rn (i.e., learning). The
IP
learning problem can be defined as an optimization problem with the objective
of finding the optimal parameter vector, θ ∗ , that minimizes the following loss
CR
function (Katrutsa & Strijov, 2017)
US
where L is the loss function that evaluates the quality of θ and S ⊆ J
is a feature subset index. In supervised classification, it is assumed that the
AN
features include noise and some of them are irrelevant features, which introduce
additional error in the estimation of θ ∗ . Thus, feature selection aims at removing
these noisy and irrelevant features to reduce the computation complexity by
M
reducing the dimensionality of the problem (Ji et al., 2017) and to improve the
performance of the classification algorithm (Lin et al., 2015). Katrutsa & Strijov
ED
23 features and the target vector and does not require the estimation of θ ∗ in (1)
24 (Katrutsa & Strijov, 2017). The optimal features subset, |S ∗ |, can be attained
25 using exhaustive search over all the 2d subsets of Z and the computational com-
AC
26 plexity of evaluating all these subsets is O(2d ) · O(J) (Kira & Rendell, 1992).
27 That is, the problem’s complexity increases exponentially as the number of the
28 features increases making the problem computationally intractable when d is
29 large. Therefore, heuristic approaches are widely used to find the near to opti-
4
ACCEPTED MANUSCRIPT
30 mal features subset within a reasonable time, which is one of the motivations of
31 this research.
32
T
34 et al., 2008) and one of the most commonly used classification algorithms is Sup-
IP
35 port Vector Machines (SVM) (Couellan et al., 2015). The popularity of SVM
36 comes from its powerful computational capabilities for supervised learning (Xue
CR
37 et al., 2010) and its generalization properties (Alham et al., 2011). SVM was
38 initially proposed for binary classification problems and later it was used for
39 multi-class classification problems using “one against one” or “one against all”
40
41 US
strategies as discussed in Milgram et al. (2006).
SVM model is trained to generate the class labels y of the training set
AN
X = [x1 , x2 , · · · , xn ], where xi ∈ Rd . Training SVM implies finding the op-
timal linear separating hyperplane with the lowest risk of misclassifying future
instances (maximum margin). This can be achieved by solving the following
M
1 Pn
ED
argmin kwk2 + C ζi
w,b,ζ 2 i=1
s.t. (3)
ζi + yi (w · xi + b) − 1 ≥ 0 ∀i
PT
ζi ≥ 0 ∀i
where yi ∈ {−1, +1} is the class label of data instance xi , whereas w and
CE
b are the weight vector and intercept (bias), respectively. The slack variable ζi
represents the distance between a misclassified data instance xi and the cor-
AC
5
ACCEPTED MANUSCRIPT
to solve the optimization problem given in (3), where its dual representation
that must be maximized with respect to the Lagrange multiplier, αi , can be
represented by (Bishop, 2006)
P
n 1 P n
T
argmin αi − αi αj yi yj k (xi , xj )
α i=1 2 i,j=1
IP
s.t.
(4)
0 ≤ αi ≤ C ∀i
P
CR
n
αi yi = 0 ∀i
i=1
where k (xi , xj ) is the kernel function that allows the maximum margin
US
model to be efficiently applied in higher dimensional features space (Bishop,
2006), as the data instances are not linearly separable in most classification
problems. The Radial Basis Function (RBF) is the most commonly used kernel
AN
due to its capability to almost take any decision boundary shape to classify the
data instances (Devos et al., 2014). In this paper, RBF is adopted, which is
given by
M
2
k (xi , xj ) = exp(−γkxi −xj k ) (5)
To improve the classification accuracy of SVM with RBF kernel, the regular-
ED
42
48 found that simultaneous SVM model selection and feature selection is crucial
49 due to the interdependency between selected features subset and the tuned SVM
hyperparameters. Therefore, this research aims at developing a new hybrid
AC
50
51 approach for simultaneous feature selection and SVM model selection. The new
52 approach is based on integration of SVM and a proposed variation of Cohort
53 Intelligence (CI).
54 CI is a new metaheuristic algorithm proposed by Kulkarni et al. (2013) that
6
ACCEPTED MANUSCRIPT
55 is inspired from both nature and society tendency to commune and learn from
56 each other. The local search (exploitation) in this algorithm is implemented by
57 allowing each candidate to control its behavior (fitness value) through sampling
58 qualities (solutions) from its sampling interval. The global search (exploration)
T
59 is implemented by allowing each candidate to observe the behaviors of other
IP
60 candidates to find the best behavior to follow. Each candidate performs both
61 exploitation and exploration in each learning attempt (cycle or iteration) to
CR
62 improve the overall behavior of the entire cohort. This procedure is repeated
63 until all candidates have similar quality of behaviors for a considerable number
64 of successive learning attempts (i.e., cohort saturation), which means that no
65
66
67
US
further improvement can be achieved. CI has been applied to several appli-
cations, including unconstrained optimization problems (Kulkarni et al., 2013)
and assignment problems (Kulkarni & Shabir, 2016). Krishnasamy et al. (2014)
AN
68 found that integrating a modified version of CI with k-means algorithm results
69 in high quality solutions in data clustering problem.
70 This paper presents a new hybrid approach for simultaneous feature selection
M
71 and SVM model selection in which a new CI variation is integrated with SVM.
72 Accordingly, the main contributions of this research can be outlined as follows:
ED
7
ACCEPTED MANUSCRIPT
T
90 SACI directly applicable to feature selection problem. Similarly, to
IP
91 make CI directly applicable to the feature selection problem, one-bit
92 flip mutation is also introduced to the behavior sampling step (i.e.,
CR
93 a binary version of CI). The motivations of these modifications are
94 discussed in Section 3.2.
95 2. SACI is applied to both feature selection (i.e., in binary domain) and SVM
96
97 US
model selection (i.e., in continuous domain) simultaneously, by integrating
it with SVM resulting in a new hybrid approach (SVM-SACI) for binary
(two-class) and multi-class classification problems.
AN
98
101 Section 4 presents the new hybrid SVM-SACI approach for simultaneous feature
102 selection and SVM model selection. Section 5 presents numerical experiments
and discussions, whereas Section 6 concludes this paper and presents future
ED
103
104 work.
PT
106 Feature selection has been extensively studied in the literature (Guyon et al.,
CE
107 2008). Since the early 1970’s, many studies have addressed the dimensionality
108 reduction using various methods (Moradi & Rostami, 2015). There are two
109 main approaches for feature selection: filter approaches and wrapper approaches
AC
110 (Sánchez-Maroño et al., 2007). Filter approaches calculate the relevance score
111 of each feature regardless of the learning algorithm. Although these approaches
112 are computationally efficient, it is difficult to attain optimal features subset
113 using these approaches due to evaluating each feature without considering the
114 performance of the classification algorithms (Lin et al., 2008). On the other
8
ACCEPTED MANUSCRIPT
115 hand, wrapper approaches can find optimal features subset that increases the
116 classification accuracy by assessing the quality of every selected features subset
117 using learning algorithms (Moradi & Rostami, 2015; Zorarpacı & Özel, 2016).
118 However, wrapper approaches require higher computational efforts compared
T
119 to filter approaches (Ji et al., 2017). Thus, several studies proposed hybrid
IP
120 approaches based on filter and wrapper methods to provide a trade-off solution
121 between filter and wrapper approaches as in Kira & Rendell (1992) and Chen
CR
122 et al. (2009). The readers are referred to Li et al. (2005) for a comprehensive
123 review on feature selection from data perspective.
124 Due to the NP-hardness of feature selection problem (Guyon et al., 2008),
125
126
127
US
other research works used heuristic approaches to find the near to optimal fea-
tures subset. Martins et al. (2014) proposed multi-objective GA for gait feature
selection for walker-assisted gait classification. Avci (2009) used hybrid GA for
AN
128 features selection for digital modulation classification. Güraksın et al. (2014)
129 used PSO for feature selection for the training process of bone-age determination
130 application. Khokhar et al. (2017) used Artificial Bee Colony (ABC) for both
M
131 feature selection and adjustment of the spread constant of Probabilistic Neural
132 Network (PNN) classifier simultaneuosly. Inbarani et al. (2015) used improved
ED
133 harmony search method for selecting the best features subset. Emary et al.
134 (2016) applied Gray-Wolf Optimization (GWO) for feature selection. Alsham-
135 lan et al. (2015) used genetic bee colony algorithm to select the best features.
PT
136 On the other hand, Bennasar et al. (2015) presented Joint Mutual Information
137 Maximization (JMIM) and Normalized Joint Mutual Information Maximization
CE
138 (NJMIM) for feature selection. López & Maldonado (2017) proposed a group-
139 penalized feature selection procedure that aims at removing features in both
140 twin hyperplanes of Twin SVM (TWSVM).
AC
141 SVM model is a well-known classifier (Couellan et al., 2015). Its performance
142 depends mainly on tuning its hyperparameters, which is an active research line
143 (Zhang et al., 2015). In addition, large number of features causes expensive
144 computational complexity and can reduce the SVM classification accuracy (Ka-
145 trutsa & Strijov, 2017). Therefore, various methods have been proposed in the
9
ACCEPTED MANUSCRIPT
146 literature for SVM model selection including direct selection, numerical and
147 non-numerical methods. An example of direct selection methods is grid search
148 using Cross Validation (CV) (Hsu et al., 2003), which is a commonly used
149 method due to its simplicity. However, these approaches are inefficient since the
T
150 quality of the solutions are not used during the search process (Ji et al., 2017).
IP
151 On the other hand, numerical methods have been employed for SVM model
152 selection, such as Gradient Descent method (Chapelle et al., 2002). However,
CR
153 gradient-based methods suffer from their fast convergence to local optimum due
154 to the non-convexity of the generalization bounds and their sensitivity to the
155 starting points (Zhang et al., 2015). Therefore, metaheuristics (non-numerical
156
157
158
US
optimization methods) have been widely employed for SVM model selection
due to, mainly, their global search capabilities. Moreover, metaheuristics can
provide near to optimal solutions within a reasonable time without additional
AN
159 information of the problem.
160 Lorena & De Carvalho (2008) combined GA with SVM for multi-class clas-
161 sification problems, whereas Sarafrazi & Nezamabadi-pour (2013) used Grav-
M
162 itational Search Algorithm (GSA) for binary class classification. Jiang et al.
163 (2014) used Improved Adaptive Genetic Algorithm (IAGA), whereas Chou et al.
ED
164 (2014) proposed Fast Messy GA (FMGA) for SVM model selection. Samma
165 et al. (2016) used a Memetic Algorithm (MA) for SVM model selection. Gao
166 & Hou (2016) applied Principle Component Analysis (PCA) for dimensionality
PT
167 reduction, and then grid search and PSO were used to optimize SVM hyperpa-
168 rameters. Zhang et al. (2015) proposed, first, reducing the search space using
CE
169 Inter-Cluster Distance in Feature Space (ICDF) measure and then using Bare
170 Bones Differential Evolution (BBDE) to optimize the SVM hyperparameters.
171 Czarnecki et al. (2015) proposed Bayesian and Random Search (RS) optimiza-
AC
172 tion for robust SVM model selection. Claesen et al. (2015) presented a bag-
173 ging framework where bootstrap resampling was used to train an ensemble of
174 SVM models to increase the robustness of the approach against the noise in
175 the class label (e.g., unlabeled data instances). In this bagging strategy, SVM
176 hyperparameters were tuned using Cross Validation with grid search. Robust
10
ACCEPTED MANUSCRIPT
T
181 (SOCP) models for data with uncertainty with no distributional assumptions
IP
182 (Wang et al., 2015). However, solving these models requires high computa-
183 tional efforts when solving practical problems as it involves processing large-scale
CR
184 datasets (Wang et al., 2015). Therefore, numerical methods, such as stochastic
185 sub-gradient descent method, was used to solve robust chance-constrained SVM
186 on large-scale datasets as in Wang et al. (2015).
187
188
189
US
Most of these research works focused on either feature selection or SVM
model selection using different approaches. Although the interdependency be-
tween the selected features and tuned SVM hyperparameters is crucial (Frohlich
AN
190 et al., 2003; Maldonado et al., 2017), simultaneous feature selection and SVM
191 model selection has received less attention in the literature. Huang & Wang
192 (2006) proposed a GA-based approach for simultaneous feature selection and
M
193 SVM model selection. Similarly, Lin et al. (2015) proposed PSO-SVM approach
194 for simultaneous feature selection and SVM model selection. Ji et al. (2017)
ED
195 used Ensemble Kalman Filter (EnKF) for SVM model selection and selection
196 of features weights. Lin et al. (2015) used a recently introduced metaheuristic
197 known as Cat Swarm Optimization (CSO) for simultaneous feature selection
PT
198 and SVM model selection. Maldonado et al. (2017) extended LP formulations
199 of SVM, namely l1 -SVM and LP-SVM, to new mixed integer models and per-
CE
200 formed cost-based feature selection by considering the variable acquisition cost.
201 This approach was specifically designed for credit scoring application.
202 Similar to the objectives of Huang & Wang (2006), Lin et al. (2008), and Lin
AC
203 et al. (2015), this research aims at addressing the problems of feature selection
204 and SVM model selection simultaneously. More specifically, a new variation of
205 Cohort Intelligence (CI) is proposed and integrated with SVM resulting in a new
206 hybrid approach for simultaneous feature selection and SVM model selection.
207 CI is a real-coded algorithm that was first applied to benchmark unconstrained
11
ACCEPTED MANUSCRIPT
208 optimization problems, such as Rosenbrock function, Sphere function and Ack-
209 ley function (Kulkarni et al., 2013). More recently, Krishnasamy et al. (2014)
210 modified CI by introducing the mutation operator to overcome its premature
211 convergence. The Modified Cohort Intelligence (MCI) was integrated with the
T
212 k-means algorithm and applied to clustering problems. Kulkarni et al. (2016)
IP
213 applied CI to three combinatorial optimization problems in which the objective
214 is to find the permutation/rearrangement vector that minimizes the objective
CR
215 function. In addition, CI was applied to solve 0-1 knapsack problems (Kulkarni
216 & Shabir, 2016). However, to the best of the authors’ knowledge, this is the
217 first attempt to apply CI or a CI’s variation to feature selection or SVM model
218
219 3. US
selection. The research methodology used in this paper is presented in Section
AN
220 3. Research Methodology
221 The framework of the research methodology used in this paper is illustrated
M
224
225 approach, SVM-SACI, for simultaneous feature selection and SVM model se-
226 lection. The performance of SVM-SACI for simultaneous feature selection and
PT
227 SVM model selection is examined using ten benchmark datasets from the lit-
228 erature and compared with those of other five approaches. More specifically,
229 CI and five well-known metaheuristics, namely, Genetic Algorithm (GA), Parti-
CE
230 cle Swarm Optimization (PSO), Differential Evolution (DE) and Artificial Bee
231 Colony (ABC) are integrated with SVM to form the following hybrid approaches
SVM-CI, SVM-GA, SVM-PSO, SVM-DE and SVM-ABC, respectively. The
AC
232
233 comparative results are based on the average of SVM classification accuracy
234 on the testing set and dimensionality reduction over 20 independent runs. CI
235 and its proposed variation, SACI, are described in Section 3.1 and Section 3.2,
236 respectively.
12
ACCEPTED MANUSCRIPT
T
241 f (φs )) of an individual candidate in the cohort (s = 1, · · · , S). Each candidate
IP
242 selects a candidate’s behavior to follow using roulette selection. Then, each can-
243 didate tries to improve its behavior by sampling Q qualities from its sampling
CR
244 interval, ψ p , that is updated in each learning attempt, where φp represents the
245 pth parameter (features and SVM hyperparameters are collectively referred to
246 as parameters in this paper). The improvement in the candidate’s behavior over
247
248
US
learning attempts, l = {1, · · · , L}, implies maximizing f (φs ) (for maximization
problems). The pseudocode of CI algorithm for maximization problems can be
given as in Algorithm 1.
AN
249
Tournament-
Cohort Self-Adaptive
Based
Intelligence (CI) Scheme
Mutation
ED
Algorithms/
Operators
Self-Adaptive
PT
Support Vector
Cohort Intelligence
Machine (SVM)
(SACI)
CE
SVM-SACI
Approach
AC
SVM
Feature
Applications Selection
Hyperparameters
Optimization
13
ACCEPTED MANUSCRIPT
T
L: Max. no. of learning attempts
: Convergence tolerance
IP
τ max : Max. no. of successive learning attempts for saturation
Procedure:
CR
1: Set τ = 1
s 1 S
2: Generate random candidates, Φ = {φ , · · · , φ }
s 1 S
3: Evaluate initial candidates, F (Φ ) = {f (φ ), · · · , f (φ )}
4: for l = 1 : L do
5:
6:
for s = 1 : S do
US
Select a candidate’s behavior to follow with a probability, ps , using
roulette selection approach, i.e., pls = S
P
f (φs )
f (φs )
, s = 1, · · · , S
AN
s=1
7: Shrink sampling interval of each parameter φsp to its neighborhood using
h i
kψ k kψ k
ψ sp ∈ φsp − r × 2p , φsp + r × 2p
8: Sample Q qualities from the updated neighborhood, ψ sp , for each pa-
M
rameter φsp
9: Evaluate Q sampled qualities and select the best behavior to update
the current behavior of candidate s
10: end for
ED
252 follow each other during search process (Krishnasamy et al., 2014), which
253 means that the diversity of the solutions in the exploration process is
14
ACCEPTED MANUSCRIPT
254 limited.
255 2. The sampling interval in CI is not adaptive and requires determining the
256 sampling interval reduction factor r as a priori.
257 3. CI, as proposed by Kulkarni et al. (2013), is a real-coded algorithm and,
T
258 hence, is not directly applicable to binary optimization problems, such as,
IP
259 feature selection. This is because the exploitation process is designed to
260 sample qualities from continuous domain.
CR
261 In this research, a binary version of CI is presented by modifying the repre-
262 sentation of the quality of the candidate’s behavior and quality sampling step
to make CI applicable to binary domain. The binary version of CI is similar to
US
263
264 that of the proposed variation of CI, which is described in Section 3.2.
268 of the continuous parameters and the mutation rate of the binary pa-
269 rameters. The self-adaptive sampling interval is based on the candidate’s
behavior to improve the exploitation process. More specifically, as the
ED
270
271 quality of the candidate’s behavior increases, the sampling interval and
272 mutation rate decrease and vice versa. The motivation for proposing the
PT
276 (2011). For instance, Kivijärvi et al. (2003) found that introducing the
277 self-adaptive scheme to GA led to comparable to or better performance
AC
15
ACCEPTED MANUSCRIPT
283 havior and the behavior of the entire cohort. Similarly, the sampling in-
284 terval of the continuous parameter φp may shrink or expand based on the
285 candidate’s behavior. The self-adaptive scheme for SVM model selection
286 and feature selection problem is described as follows:
T
i. For SVM model selection: the sampling interval ψ sp ∈ ψpmin,s , ψpmax,s
IP
of continuous parameter φsp is updated at each learning attempt by
updating its lower and upper bounds, ψpmin,s and ψpmax,s , respectively
CR
as follows
kψ p k
ψ min,s
p = max ψp
min
, φs
p − (1 − f (φ s
)) ×
2
(6)
287
ψpmax,s s
US s
= min φp + (1 − f (φ )) ×
kψ p k max
2
, ψp
1 s s
pls = max , max (F (Φ )) − f (φ ) (7)
d
AC
292 where pls is the probability that each bit in the binary part of a can-
293 didate s will undergo mutation at learning attempt l, whereas F (Φs )
294 is a vector of the cohort behaviors. That is, the rate at which the
295 binary part of each candidate will undergo mutation is determined
296 based on the difference between its behavior and the best behavior in
16
ACCEPTED MANUSCRIPT
T
301 minimum (Krishnasamy et al., 2014). In SACI, tournament selection is
IP
302 used to select S −1 candidates to undergo mutation in continuous domain.
303 Tournament selection is a commonly used selection operator with GA (Wu
CR
304 et al., 2016). Using tournament selection (instead of roulette selection as
305 in Krishnasamy et al. (2014)) is motivated by its capability of improving
306 the diversity and convergence of DE as found in Qiu et al. (2013). These
307
308 US
advantages could be attributed to the low susceptibility of tournament
selection to be taken over by dominant individuals (Noraini & Geraghty,
2011). Also, the candidate with the best behavior at each learning attempt
AN
309
310 is passed on the next learning attempt to include its neighborhood in the
311 quality sampling step. That is, S − 1 mutant candidates in addition to the
candidate with the best behavior in previous learning attempt will be used
M
312
315 follows:
316 i. For SVM model selection: tournament selection is performed by ran-
317 domly selecting two candidates (i.e., tournament size of two) and the
PT
318 candidate with the best behavior is selected for mutation. Then, the
0
319 mutation on each parameter φsp will result in φsp as given in (8).
CE
φsp + u · ψpmax,s − φsp · (1 − l ), if u ≥ 0.5
s0 L
φp = (8)
φs − u · φs − ψ min,s · (1 − l ), otherwise
p p p
AC
L
320 where u ∈ [0, 1] is a random number, whereas l and L are the current
321 and maximum number of learning attempts, respectively.
322 ii. For feature selection: to increase the diversity of the cohort, one-bit
323 flip mutation with a rate of 1/S, where S is the number of candidates.
17
ACCEPTED MANUSCRIPT
T
328 other potential good search areas (Noraini & Geraghty, 2011). In contrary,
IP
329 tournament selection prevents the dominance of solutions with high fitness
330 by giving the chance to those with less fitness to be selected.
CR
331 Based on these modifications, the only control parameters that need to be
332 set by the user in SACI are the number of candidates, S, and number of quality
333 variation, Q, whereas in CI there is also the sampling interval reduction factor,
334
US
r ∈ [0, 1]. The pseudocode of SACI algorithm can be given as in Algorithm 2.
AN
335 4. The Proposed SVM-SACI Approach
336 In this section, the proposed SVM-SACI approach for simultaneous feature
337 selection and SVM model selection is presented. The structure of the solution
M
338 that represents SVM hyperparameters C and γ (real-valued) and the selected
339 features subset (binary-valued) is shown in Figure 2.
ED
340 The binary representation of the features implies that each feature is repre-
AC
341 sented by 1 if selected and 0 if not selected. The candidate’s behavior represents
342 the SVM classification accuracy on the training set, f (φs ), and the complex-
343 ity of the selected features (i.e., number of selected features |S ∗ |). The fitness
18
ACCEPTED MANUSCRIPT
T
L: Max. no. of learning attempts
: Convergence tolerance
IP
τ max : Max. no. of successive learning attempts for saturation
Procedure:
CR
1: Set τ = 1
2: Generate random candidates Φs = {φ1 , · · · , φS }
3: Evaluate initial candidates F (Φs ) = {f (φ1 ), · · · , f (φS )}
4: for l = 1 : L do
5:
6:
using tournament selection
US
Keep the candidate with the best behavior and select S − 1 candidates
Perform mutation in continuous domain using (8) and one-bit flip muta-
tion with rate of 1/S in binary domain
AN
7: Evaluate mutant candidates and update the cohort with the mutant can-
didates
8: for s = 1 : S do
9: Select a candidate’s behavior to follow using tournament selection
M
10: Update sampling interval in continuous domain using (6) and mutation
rate in binary domain using (7)
11: Sample Q qualities (solutions) using the updated neighborhood ψ sp and
ED
17: τ =τ +1
18: ψ sp ∈ [ψpmin , ψpmax ], s = 1, · · · , S
19: end if
20: end if
AC
21: end if
22: if τ = τ max then
23: break
24: end if
25: end for
Output: Best solution attained by the cohort
19
ACCEPTED MANUSCRIPT
|J | − |S ∗ |
F itness = λf (φs ) + (1 − λ) (9)
|J |
where |J | and |S ∗ | are the cardinalities of the original features set index
T
345
346 (number of original features) and selected features subset index (number of se-
IP
347 lected features), respectively, and λ ∈ [0, 1] is the trade-off factor. This weighted
348 sum fitness function penalizes the solutions with higher number of selected fea-
CR
349 tures (higher |S ∗ |) to allow the selection of smaller features subset. The trade-off
350 factor can be set by the user based on the importance of each objective. In this
research, λ was set to 0.8 in all numerical experiments as in (Huang & Wang,
US
351
356 1. SACI requires less number of control parameters to be set by the user as
M
357 a priori compared to those of the other algorithms. The only common
358 control factors required to be set by the user in SACI are number of
ED
359 candidates, the quality variations and mutation rate for quality sampling
360 in binary domain. In addition to these control factors, CI (with its binary
361 version introduced in this) requires setting the sampling interval reduction
PT
362 factor r as a priori, which is not required in SACI due to its self-adaptive
363 scheme. As for the other well-known metaheuristics that have been used
CE
364 for the feature selection and SVM model selection in the literature, SACI
365 requires setting the number of candidates and the quality variations by the
366 user same as the population size, swarm size, two population sizes, and
AC
367 colony size in GA, PSO, DE and ABC, respectively. However, the other
368 metaheuristics require setting more control parameters, namely crossover
369 rate in GA and DE, learning factors and inertia weight in PSO and the
370 scout limit in ABC.
371 2. The binary coding representation of the continuous parameters, as in
20
ACCEPTED MANUSCRIPT
START
T
IP
Evaluate the quality of behaviors of the cohort in
current learning attempt, l, using fitness function (9)
CR
Keep the candidate with the best behavior and select S
– 1 candidates using tournament selection
US
Perform mutation in continuous domain using (8) and
one-bit flip mutation with a rate of 1/S in binary
domain and evaluate mutant candidates
AN
Each candidate selects a candidate’s behavior to
follow using tournament selection
M
No l = L or Cohort
CE
saturated?
Yes
AC
STOP
21
ACCEPTED MANUSCRIPT
372 SVM-GA (Huang & Wang, 2006), increases the computation cost and
373 can influence the performance of the search strategy (Chen et al., 2015a),
374 while in SACI (similar to CI) real coding is used for continuous parame-
375 ters.
T
376 3. The greedy strategy (used in ABC and DE) is not applied in SACI as
IP
377 it can lead to abandoning potential good search areas and reducing the
378 effectiveness of exploring the search space, as consequence (Chakhlevitch
CR
379 & Cowling, 2008). In ABC, the greedy selection is used by the employed
380 bees to decide whether to select the new food source (solution) during the
381 neighborhood search or to keep the old food source (Liu et al., 2013). That
382
383 US
is, if the nectar amount (fitness) of the new food source is higher than that
of the old food source, then the new food source will be selected by the
employed bees. Otherwise, the old food source will be kept. Similarly,
AN
384
385 the offspring (new solution) in DE replaces the parent (old solution) if it
386 has higher fitness value, otherwise the parent survives and passes on to
the next iteration (Vesterstrom & Thomsen, 2004). In SACI, S − 1 mu-
M
387
388 tant candidates are included in the quality sampling step (neighborhood
389 search) at each learning attempt even if their behaviors (fitness values)
ED
390 are not better than their parent candidates (i.e., non-greedy selection).
391 To pass on the best search area reached in the previous learning attempt
392 to the current learning attempt, the candidate with the best behavior from
PT
393 previous learning attempt is included in the quality sampling step. This
394 non-greedy selection prevents abandoning search areas of mutant candi-
CE
395 dates whose behaviors are not better than those of their parent candidates.
396 Therefore, SACI could be more effective than ABC and DE in exploring
397 the search space.
AC
398 4. The diversity of the solutions in SACI is higher than that of CI due to
399 employing the mutation operator in SACI, which can prevent fast conver-
400 gence to local optimum.
401 The advantages of the proposed SVM-SACI are originated from the search
22
ACCEPTED MANUSCRIPT
402 mechanism of SACI. However, there are different limitations in SACI. First,
403 SACI requires more function evaluations compared to CI due to diversifying
404 the solutions using mutation, which requires additional computation time as
405 consequence. Second, the stopping criteria in SACI (and also in CI) is more
T
406 conservative than that of GA, PSO, DE and ABC, which can cause SACI to
IP
407 run until the maximum number of learning attempts is reached in most cases.
408 The convergence condition in GA, PSO, DE, and ABC is, basically, reached
CR
409 when no improvement in the best solution attained throughout the search it-
410 erations for a specific number of successive iterations. Whereas in SACI, the
411 convergence condition is reached when there is no change greater than the con-
412
413
414
US
vergence tolerance, , in the minimum and maximum fitness values of the cohort,
which is less likely to be reached compared to the convergence condition in GA,
PSO, DE, and ABC. This is because the minimum fitness value could signifi-
AN
415 cantly change within few iterations due to the diversity of SACI solutions and
416 applying the non-greedy search. Thus, the user should consider the impact of
417 the convergence speed of SACI when setting the value of . Third, the diver-
M
418 sity of solutions of SACI is less than that of GA since the candidates in SACI
419 determine the diversity of the solutions through mutation and their number is,
ED
420 typically, less than the population size in GA in case same number of solutions
421 is used in each iteration or learning attempt. For example, when setting the
422 population size in GA to 100 and both number of candidates and quality varia-
PT
423 tions in SACI to 10 (10 × 10 = 100), then only 9 solutions in SACI will undergo
424 mutation during exploration at each learning attempt, which is less than those
CE
425 in GA. Thus, the user should consider high mutation rate in SACI that is in-
426 versely proportional to the number of candidates. However, large population
427 size in GA (i.e., larger diversity) does not necessary improve the performance
AC
428 of GA. According to Alajmi & Wright (2014), small population size in GA can
429 converge fast to optimum solutions.
23
ACCEPTED MANUSCRIPT
431 This section presents the results of the numerical experiments using ten
432 benchmark datasets from the literature, which are described in Section 5.1.
The performance of the proposed SVM-SACI approach was compared with that
T
433
434 of SVM-CI approach in terms of the SVM classification accuracy on the testing
IP
435 (out-of-sample) set and dimensionality reduction. In addition, the performance
436 of the proposed SVM-CI approach was compared with other four hybrid ap-
CR
437 proaches, namely SVM-GA, SVM-PSO, SVM-DE and SVM-ABC, which are
438 formed by integrating GA, PSO, DE and ABC, respectively, with SVM. The
parameter settings of all algorithms are given in Section 5.2 and the numerical
US
439
442 The description of the datasets used in the numerical experiments is pre-
443 sented in Table 1. Every dataset was randomly split into 70% for training and
M
444 30% for testing. All datasets were scaled such that all variables have values
445 within the range -1 and 1. Scaling the variables prevents the variables with a
446 variance from dominating those with low variance and decreases the computa-
ED
447 tional complexity (Hsu et al., 2003). Moreover, missing values are not included
448 in the given datasets.
PT
Zoo 101 17 7
Land Cover 168 148 9
Musk 476 168 2
LSVT Voice 127 309 2
24
ACCEPTED MANUSCRIPT
T
Table 2: Parameter settings of CI and SACI.
IP
Algorithm
Dataset
CI SACI
No. of candidates 10 10
CR
No. of quality variations 10 10
Initial sampling intervals (C/γ) 50/0.2 50/0.2
Mutation rate for behavior sampling in binary domain 0.1 0.1
Sampling interval reduction factor 0.8 Self-adaptive
453
US
Given that the solution’s structure includes real-valued and binary-valued
AN
454 parameters as illustrated in Figure 2, real-coded and binary versions of all com-
455 pared metaheuristics were used for SVM hyperparameters and features, respec-
456 tively. The details of the parameter settings of the other algorithms are given
M
457 as follows:
459 crossover with a probability of 0.8 were used. In addition, the number of
460 individuals that survive at each iteration (i.e., elitism) was set to 1 and
461 the mutation rate was set to 0.01 (i.e., 1/population size).
PT
464 version of DE as proposed by Chen et al. (2015b) was used for binary
465 parameters (i.e., features).
466 3. PSO parameters: both learning factors were set to 1.8, whereas the inertia
AC
467 weight was set to 0.6. In SVM-PSO as proposed in Lin et al. (2008), the
468 binary parameters (i.e., features) are represented by real-values between
469 0 and 1 and undergo the same PSO mechanism in continuous domain and
470 simply rounded to 0 or 1 to determine whether a feature is selected or re-
471 moved before evaluating the solution (SVM training). This mechanism in
25
ACCEPTED MANUSCRIPT
472 binary domain does not imitate the mechanism of PSO in continuous do-
473 main, which may influence the performance of PSO. Khanesar et al. (2007)
474 proposed a binary version of PSO, which better imitates its mechanism
475 in continuous domain and reported better performance of PSO in binary
T
476 domain. Therefore, the binary version of PSO as proposed by Khanesar
IP
477 et al. (2007) was employed in this paper for feature selection.
478 4. ABC parameters: equal number of employed and onlooker bees were used,
CR
479 whereas the scout limit was set to 20. The binary-coded ABC as proposed
480 by Liu et al. (2013) was used for features selection.
481
482
483
5.3. Model Selection
US
In this paper, 3 × 5-fold CV was used during the SVM training process, i.e.,
5-fold CV repeated 3 times. Recently, Krstajic et al. (2014) discussed the pitfalls
AN
484 of k-fold Cross Validation (CV) in model selection and found that using repeated
485 k-fold CV results in robust models. Therefore, n × k-fold CV was used in all
486 numerical experiments in this paper for SVM model selection. In k-fold CV, the
M
487 training set is split into k subsets, where k − 1 subsets are used for training and
488 the remaining subset is used as a validation subset. This procedure is repeated
ED
489 k times by swapping the training and validation subsets until every data subset
490 is used as a validation subset once (Hastie et al., 2009). After the best SVM
491 model and best features subset are selected by the search strategy, the selected
PT
492 SVM model is evaluated using the testing set with the selected features subset.
493 The evaluation of the solution’s quality is based on the classification accuracy
CE
494 on testing set and dimensionality reduction. The SVM classification accuracy
495 on the testing set is calculated by dividing the number of the correctly classified
496 testing instances by the total number of the testing instances.
AC
498 The numerical experiments were conducted on Windows using Intel Xeon
499 CPU X5670 @ 2.96 GHz and 48 GB RAM. All approaches were implemented in
500 R statistical programming language. The mean, standard deviation, minimum
26
ACCEPTED MANUSCRIPT
501 and maximum values of SVM classification accuracy on the testing set, number
502 of selected features and CPU time in seconds for all approaches based on 20
503 independent runs are given in Table 3 and Table 4. As it can be seen Table
504 3 and Table 4, the proposed approach SVM-SACI outperformed the other ap-
T
505 proaches in terms of both the average accuracy on the testing set and average
IP
506 dimensionality reduction in five datasets out of ten; namely, Australian credit,
507 Breast Cancer, Parkinson, Statlog, and LSVT Voice and Zoo. On the other
CR
508 hand, none of the other approaches could achieve both the highest accuracy
509 and dimensionality reduction on a single dataset. In other words, the proposed
510 approach outperformed the other compared approaches for simultaneous feature
511
512
513
US
selection and SVM model selection in 50% of the datasets used in this paper.
In addition, the proposed approach achieved the both highest average accu-
racy on Australian dataset and lowest average number of selected features with
AN
514 zero S.D., which indicates its robustness by finding stable results over the 20
515 runs. The solution given by the proposed approach for Australian credit scoring
516 dataset, is very simple (only one feature is required) and effective since achieved
M
517 high accuracy compared to the other approaches (85.10%). According to Mal-
518 donado et al. (2017), prediction models that use ≤ 10 features are of prime
ED
519 interest of companies that employ risk models for credit scoring. Based on this
520 finding, the proposed approach has a good potential for such applications.
521 Although, SVM-DE achieved the highest average accuracy (70.51%) on Pima
PT
522 dataset using 1-3 features as shown in Table 3, the proposed approach achieved
523 very close average accuracy (69.85%) using only one selected feature through-
CE
524 out the 20 runs (zero S.D.). Also, SVM-SACI achieved the highest average
525 accuracy (91.83%) on Musk dataset with 43.17 more selected features (on av-
526 erage) compared to that of SVM-DE, however, the average accuracy given by
AC
527 the latter is 10% less than that given by the proposed approach. That is, the
528 smallest selected features subset achieved by SVM-DE on Musk dataset was in
529 the considerable expense of the accuracy. Furthermore, SACI-SVM found the
530 smallest features subsets on three datasets (30% of the datasets), namely Pima,
531 Wine and Zoo with very close average accuracy to the highest average accuracy
27
ACCEPTED MANUSCRIPT
532 achieved by the other approaches (highest difference in average accuracy is <
533 4%). On land cover dataset, SVM-CI achieved the smallest average selected
534 features subset with an average accuracy of 82.18% less than the highest av-
535 erage accuracy (84.35%) achieved by SVM-GA. However, the average selected
T
536 features subset given by SVM-GA on this dataset is higher than that of SVM-
IP
537 CI by 24.75%, which might not justify 2.17% increase in the average accuracy
538 given by SVM-GA. Furthermore, Figures 4−13 show that SVM-SACI outper-
CR
539 formed SVM-CI in terms of the average accuracy and dimensionality reduction
540 on 90% of the datasets used in this paper, which demonstrates the effectiveness
541 of the modifications presented in Section 3.2. In terms of the CPU time, SVM-
542
543
544
US
SACI required more CPU time to achieve the highest average accuracy and
dimensionality reduction on 50% of the datasets. However, the extra CPU time
could be justified by the quality of the solutions attained by SVM-SACI. For ex-
AN
545 ample, SVM-SACI required 43.7% average CPU time more than that required
546 by SVM-DE to achieve the highest average accuracy on LSVT Voice dataset
547 (83.29%). However, the average accuracy achieved by SVM-DE approach was
M
548 65.80%, i.e., 20% less than that achieved by SVM-SACI, which could be justi-
549 fiable in many applications. Furthermore, on the same dataset, LSVT Voice,
ED
550 SVM-PSO achieved the highest average accuracy same as SVM-SACI. However,
551 SVM-PSO required 10% more average time than that required by the proposed
552 approach. In addition to that, there are other approaches, such as SVM-ABC
PT
553 on Wine dataset, required more time than that of SVM-SACI to achieve the
554 highest average accuracy and dimensionality reduction. However, as discussed
CE
558 until the maximum number of learning attempts is reached even if there is no
559 improvement in the best solution for considerable number of successive learning
560 attempts. The average accuracy and number of selected features achieved by
561 all approaches on the datasets used in this paper are depicted in Figures 4−13.
28
AC
Table 3: Mean, standard deviation (S.D.), minimum and maximum values of the number of selected features, and SVM classification accuracy on
the testing sets based on 20 independent runs.
CE
Dataset Australian Breast Cancer Wisconsin (Diagnostic) Parkinson Pima (Indians Diabetes) Heart Disease (Statlog)
Approach Criteria CPU Time No. of Accuracy CPU Time No. of Accuracy CPU Time No. of Accuracy CPU Time No. of Accuracy CPU Time No. of Accuracy
(Sec.) Features (%) (Sec.) Features (%) (Sec.) Features (%) (Sec.) Features (%) (Sec.) Features (%)
PT
SVM-CI Mean 642.27 5.05 83.47 498.21 12.45 85.97 472.83 8.25 85.08 1306.28 1.75 64.98 503.29 3.85 71.42
S.D. 156.38 1.05 2.24 26.22 2.78 7.59 15.61 2.57 4.66 587.64 0.85 3.95 9.67 1.31 5.49
Min. 211.99 3.00 76.90 456.04 8.00 70.20 447.80 3.00 78.00 701.84 1.00 57.60 483.81 2.00 60.50
Max. 1149.54 7.00 86.10 650.72 18.00 95.90 504.38 13.00 93.20 2952.91 4.00 72.30 518.20 6.00 80.20
ED
SVM-ABC Mean 826.85 1.45 85.10 512.60 6.40 97.90 438.60 4.00 86.76 1116.39 1.00 70.42 469.23 3.60 81.24
S.D. 320.81 0.51 0.16 42.38 1.34 0.96 3.84 0.71 3.24 286.79 0.00 0.40 6.94 0.55 2.57
Min. 592.31 1.00 84.60 469.27 5.00 96.50 432.89 3.00 83.10 815.59 1.00 69.70 460.11 3.00 79.00
Max. 1600.17 2.00 85.60 583.12 8.00
M 98.80 443.43 5.00 91.50 1489.12 1.00 70.60 478.81 4.00 84.00
29
SVM-GA Mean 618.51 2.65 84.96 486.74 10.05 98.49 349.46 7.10 86.50 1378.63 1.20 70.22 338.42 3.80 80.31
S.D. 139.90 0.99 0.65 47.62 1.67 0.69 16.86 1.55 3.53 116.71 0.52 1.06 40.10 1.61 3.74
Min. 267.35 1.00 82.70 329.75 7.00 97.10 330.16 4.00 79.70 1192.87 1.00 68.80 88.50 1.00 72.80
Max. 1219.29 4.00 85.60 493.56 13.00 100.00 482.91 10.00 96.60 1615.19 3.00 73.20 219.40 6.00 84.00
SVM-DE Mean 878.93 1.20 85.10 302.78 5.20 98.34 304.63 5.30 86.52 1381.97 1.65 70.51 327.07 5.00 81.16
S.D. 315.09 0.45 0.00 29.81 1.48 0.85 70.59 1.78 3.96 42.64 0.70 0.86 56.18 0.79 2.34
Min. 678.67 1.00 85.10 251.13 3.00 97.10 238.36 2.00 79.70 1334.31 1.00 69.30 255.56 3.00 75.30
AN
Max. 1433.65 2.00 85.10 324.63 7.00 99.40 421.09 8.00 94.90 1478.81 3.00 71.90 438.66 6.00 85.20
SVM-PSO Mean 548.55 1.20 85.10 455.06 9.70 98.31 220.31 7.10 88.71 1343.39 1.00 70.46 376.14 3.95 82.71
S.D. 253.51 0.41 0.00 75.67 2.32 0.92 79.00 1.55 4.73 241.59 0.00 0.81 84.91 1.05 0.41
Min. 220.52 1.00 85.10 272.91 5.00 95.90 142.10 4.00 79.70 1221.07 1.00 69.70 142.93 3.00 81.50
ACCEPTED MANUSCRIPT
Max. 1196.29 2.00 85.10 613.73 14.00 99.40 377.25 10.00 96.60 1535.52 1.00 72.30 471.33 7.00 84.00
US
SVM-SACI Mean 679.92 1.00 85.10 398.03 3.40 98.61 326.13 4.60 89.14 1356.63 1.00 69.85 434.34 3.00 82.71
S.D. 192.81 0.00 0.00 2.96 0.50 0.56 5.08 1.23 6.16 178.45 0.00 0.25 4.72 0.00 0.00
Min. 265.41 1.00 85.10 393.71 3.00 97.10 315.26 3.00 78.00 1176.15 1.00 69.70 430.83 3.00 82.70
Max. 1310.57 1.00 85.10 404.70 4.00 99.40 336.60 6.00 94.90 2349.06 1.00 70.60 441.24 3.00 82.70
Note: best performance for each dataset is marked with boldface.
CR
IP
T
AC
Table 4: Mean, standard deviation (S.D.), minimum and maximum values of the number of selected features, and SVM classification accuracy on
the testing sets based on 20 independent runs.
CE
Dataset Wine Zoo Musk LSVT Voice Land Cover
Approach Criteria CPU Time No. of Accuracy CPU Time No. of Accuracy CPU Time No. of Accuracy CPU Time No. of Accuracy CPU Time No. of Accuracy
(Sec.) Features (%) (Sec.) Features (%) (Sec.) Features (%) (Sec.) Features (%) (Sec.) Features (%)
PT
SVM-CI Mean 481.82 5.35 83.32 658.29 5.80 74.35 475.11 80.20 81.03 410.49 145.55 73.56 1323.00 42.80 82.18
S.D. 4.77 1.57 10.46 6.92 1.28 8.71 7.98 8.75 14.24 3.37 9.13 9.09 51.34 1.30 1.11
Min. 474.24 3.00 51.90 644.75 3.00 58.10 465.17 68.00 58.00 405.47 133.00 65.80 1267.89 41.00 80.80
Max. 490.75 8.00 100.00 669.84 8.00 93.50 496.69 97.00 95.80 417.65 165.00 89.50 1397.94 44.00 83.70
ED
SVM-ABC Mean 445.17 4.20 96.64 586.60 5.80 83.22 596.88 70.25 88.91 518.55 141.05 82.76 1823.26 57.47 83.09
S.D. 5.40 0.84 3.57 3.52 1.79 8.64 14.25 8.12 4.41 7.40 9.42 5.95 193.53 4.07 1.45
Min. 439.22 3.00 90.70 580.73 3.00 71.00 565.03 55.00 76.20 495.12 112.00 65.80 1456.60 52.00 81.30
Max. 450.75 5.00 100.00 589.41 8.00
M 90.30 642.34 82.00 95.10 534.29 156.00 89.50 2128.14 64.00 85.70
30
SVM-GA Mean 431.68 4.15 93.04 182.98 5.35 81.77 572.93 78.30 90.38 113.61 150.20 82.76 1284.03 67.55 84.35
S.D. 31.24 0.67 4.41 37.42 1.14 7.27 68.67 5.13 1.93 27.17 6.44 4.04 178.79 4.15 1.88
Min. 398.69 2.00 85.20 145.76 3.00 67.70 541.59 71.00 87.40 76.68 136.00 73.70 915.17 59.00 80.80
Max. 506.39 5.00 100.00 237.39 8.00 93.50 852.56 90.00 93.70 168.63 162.00 89.50 1681.16 75.00 88.70
SVM-DE Mean 285.43 4.87 92.82 491.22 6.05 81.93 449.67 30.50 81.77 325.54 89.20 65.80 1450.38 8.67 74.40
S.D. 45.71 0.74 4.19 107.02 0.94 6.06 97.62 27.98 6.58 24.98 7.05 0.00 563.02 3.51 9.16
Min. 248.49 4.00 83.30 358.88 4.00 67.70 318.74 5.00 73.40 289.02 82.00 65.80 1026.85 5.00 65.00
AN
Max. 393.29 6.00 98.10 621.67 8.00 90.30 562.28 84.00 92.30 357.99 101.00 65.80 2089.29 12.00 83.30
SVM-PSO Mean 263.35 4.30 94.52 529.94 5.75 85.64 541.56 88.60 91.08 514.49 202.30 83.29 1252.01 85.65 83.45
S.D. 67.35 0.86 3.48 3.07 0.91 5.38 105.88 12.26 3.97 118.71 22.03 4.04 256.77 9.72 1.99
Min. 145.45 3.00 88.90 527.44 4.00 74.20 448.35 78.00 80.40 405.90 169.00 76.30 1055.05 60.00 80.30
ACCEPTED MANUSCRIPT
Max. 396.38 6.00 100.00 534.63 8.00 93.50 818.02 117.00 95.80 729.88 244.00 89.50 2128.71 100.00 87.20
US
SVM-SACI Mean 417.73 3.00 93.32 347.22 4.40 84.50 617.09 73.67 91.83 467.83 135.30 83.29 1647.15 54.00 83.36
S.D. 4.05 1.00 2.83 87.08 0.82 4.87 9.06 5.59 3.80 7.44 6.39 3.11 170.64 5.73 0.85
Min. 413.64 2.00 90.70 198.34 3.00 77.40 606.18 64.00 85.30 439.66 124.00 78.90 1461.42 46.00 82.30
Max. 422.18 4.00 96.30 474.55 5.00 90.30 631.06 80.00 96.50 475.62 150.00 89.50 2075.39 63.00 84.70
Note: best performance for each dataset is marked with boldface.
CR
IP
T
ACCEPTED MANUSCRIPT
60.00
50.00
40.00
T
30.00
20.00
IP
10.00
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
Approach
CR
Accuracy Dimensionality Reduction
100.00
90.00
80.00
70.00
85.97
97.90
78.67
US
98.49
66.50
98.34
82.67
98.31
67.67
98.61
88.67
Percentage
58.50
AN
60.00
50.00
40.00
30.00
20.00
10.00
M
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
Approach
Accuracy Dimensionality Reduction
ED
563 In this paper, a new hybrid approach, SVM-SACI, for simultaneous fea-
564 ture selection and SVM model selection is presented based on integrating SVM
565 with SACI. The new CI variation, SACI, is proposed to overcome CI’s limi-
AC
31
ACCEPTED MANUSCRIPT
100.00
86.76 86.50 86.52 88.71 89.14
90.00 85.08
82.61 80.00
80.00 76.96
69.13 69.13
70.00 64.13
Percentage
60.00
50.00
40.00
T
30.00
20.00
IP
10.00
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
CR
Approach
Accuracy Dimensionality Reduction
100.00
90.00
80.00
70.00 64.98
78.13
70.42
87.50
US
70.22
85.00
70.51
79.38
70.46
87.50
69.85
87.50
AN
Percentage
60.00
50.00
40.00
30.00
20.00
M
10.00
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
Approach
ED
571 scheme, the number of SACI’s control parameters that are required to be set as
572 a priori is less than those of CI and also ABC, GA, DE and PSO.
CE
573 The results show that SVM-SACI outperformed the other compared ap-
574 proaches in terms of both average accuracy on the testing set and average
575 dimensionality reduction in 50% of the datasets, whereas none of the other
AC
576 approaches could achieve both highest average accuracy and average dimen-
577 sionality reduction on a single dataset. In addition, the proposed approach
578 SVM-SACI could achieve more stable results based on the low standard de-
579 viation of the dimensionality reduction and accuracy of its solutions over 20
32
ACCEPTED MANUSCRIPT
100.00
90.00 81.24 81.16 82.71 82.71
80.31
80.00 76.92
71.42 70.38 72.31 70.77 69.62
70.00 61.54
Percentage
60.00
50.00
40.00
T
30.00
20.00
IP
10.00
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
CR
Approach
Accuracy Dimensionality Reduction
100.00
90.00
80.00
70.00
83.32
96.64
67.69
US
93.04
68.08
92.04
62.54
94.52
66.92
93.32
76.92
AN
Percentage
58.85
60.00
50.00
40.00
30.00
20.00
M
10.00
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
Approach
Accuracy (%) Dimensionality Reduction (%)
ED
582 age accuracy and dimensionality reduction, which demonstrates the advantages
583 of the proposed modifications. Thus, based on these results, it can be concluded
584 that the proposed SVM-SACI approach has a strong potential in simultaneous
AC
33
ACCEPTED MANUSCRIPT
100.00
90.00 83.22 85.64 84.50
81.77 80.29 81.93
80.00 74.35 74.12
70.00 65.88 64.41 66.18
Percentage
60.00
50.00
50.00
40.00
T
30.00
20.00
IP
10.00
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
Approach
CR
Accuracy (%) Dimensionality Reduction (%)
Figure 10: Average SVM classification accuracies on testing set percentage of dimensionality
reduction achieved by all compared approaches for Zoo dataset (the higher the better).
100.00
90.00
80.00
70.00
81.03
88.91
US
90.38
81.77 81.85
91.08 91.83
AN
Percentage
58.18 56.15
60.00 52.26 53.39
50.00 47.26
40.00
30.00
20.00
M
10.00
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
Approach
ED
Figure 11: Average SVM classification accuracies on testing set percentage of dimensionality
reduction achieved by all compared approaches for Musk dataset (the higher the better).
PT
590 addressed in future work to improve its performance in binary domain. Further-
591 more, the application of SACI could be extended to other common problems
CE
592 such as clustering. For example, the sensitivity of Overlapping k-means (OKM)
593 clustering algorithm to its initial centroids (Khanmohammadi et al., 2017) can
594 be addressed by integrating SACI with OKM to be applied to datasets with
AC
595 overlapping information. Finally, although the experimental results indicate the
596 effectiveness of the proposed approach, search strategies based on metaheuristics
597 are generally limited to relatively small datasets due to the high computation
598 time required for high-dimensional datasets and big data. This limitation can
34
ACCEPTED MANUSCRIPT
100.00
90.00 82.76 82.76 83.29 83.29
80.00 73.56
71.13
70.00 65.80
Percentage
54.35 56.21
60.00 52.90 51.39
50.00
40.00 34.53
T
30.00
20.00
IP
10.00
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
CR
Approach
Accuracy (%) Dimensionality Reduction (%)
Figure 12: Average SVM classification accuracies on testing set percentage of dimensionality
reduction achieved by all compared approaches for LSVT Voice dataset (the higher the better).
100.00
90.00
80.00
82.18
71.08
83.09 US
84.35
75.96
83.45 82.20
70.84
AN
70.00
61.17
Percentage
60.00 54.36
50.00 42.13
40.54
40.00
30.00
20.00
M
10.00
0.00
SVM-CI SVM-ABC SVM-GA SVM-DE SVM-PSO SVM-SACI
Approach
ED
Figure 13: Average SVM classification accuracies on testing set percentage of dimensionality
reduction achieved by all compared approaches for Land Cover dataset (the higher the better).
PT
599 be addressed in future work by employing more efficient strategies for feature
600 selection and SVM model selection for high-dimensional datasets. For instance,
CE
601 Kalman Filter (KF) based methods, can be more efficient than non-numerical
602 optimization methods (metaheuristics) for high-dimensional datasets according
to the experimental results reported in (Ji et al., 2017).
AC
603
604
35
ACCEPTED MANUSCRIPT
605 References
606 Alajmi, A., & Wright, J. (2014). Selecting the most efficient genetic algorithm
607 sets in solving unconstrained building optimization problem. International
Journal of Sustainable Built Environment, 3 , 18–26.
T
608
IP
609 Alham, N. K., Li, M., Liu, Y., & Hammoud, S. (2011). A mapreduce-based
610 distributed svm algorithm for automatic image annotation. Computers &
CR
611 Mathematics with Applications, 62 , 2801–2811.
612 Alshamlan, H. M., Badr, G. H., & Alohali, Y. A. (2015). Genetic bee colony
(gbc) algorithm: A new gene selection method for microarray cancer classifi-
US
613
615 Avci, E. (2009). Selecting of the optimal feature subset and kernel parameters
AN
616 in digital modulation classification by using hybrid genetic algorithm–support
617 vector machines: Hgasvm. Expert Systems with Applications, 36 , 1391–1402.
M
618 Ben-Tal, A., Bhadra, S., Bhattacharyya, C., & Nath, J. S. (2011). Chance
619 constrained uncertain classification via robust optimization. Mathematical
620 programming, 127 , 145–173.
ED
621 Bennasar, M., Hicks, Y., & Setchi, R. (2015). Feature selection using joint
622 mutual information maximisation. Expert Systems with Applications, 42 ,
PT
623 8520–8532.
627 Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing
628 multiple parameters for support vector machines. Machine learning, 46 , 131–
629 159.
36
ACCEPTED MANUSCRIPT
630 Chen, B., Liu, H., Chai, J., & Bao, Z. (2009). Large margin feature weighting
631 method via linear programming. IEEE Transactions on Knowledge and Data
632 Engineering, 21 , 1475–1488.
Chen, J., Takiguchi, T., & Ariki, Y. (2015a). A robust svm classification frame-
T
633
634 work using psm for multi-class recognition. EURASIP Journal on Image and
IP
635 Video Processing, 2015 , 1–12.
CR
636 Chen, Y., Xie, W., & Zou, X. (2015b). A binary differential evolution algorithm
637 learning from explored solutions. Neurocomputing, 149 , 1038–1047.
638 Chou, J.-S., Cheng, M.-Y., Wu, Y.-W., & Pham, A.-D. (2014). Optimizing
639
640
US
parameters of support vector machine using fast messy genetic algorithm for
dispute classification. Expert Systems with Applications, 41 , 3955–3964.
AN
641 Claesen, M., De Smet, F., Suykens, J. A., & De Moor, B. (2015). A robust
642 ensemble approach to learn from positive and unlabeled data using svm base
643 models. Neurocomputing, 160 , 73–84.
M
644 Couellan, N., Jan, S., Jorquera, T., & Georgé, J.-P. (2015). Self-adaptive sup-
645 port vector machine: A multi-agent optimization perspective. Expert systems
ED
647 Czarnecki, W. M., Podlewska, S., & Bojarski, A. J. (2015). Robust optimization
PT
650 Devos, O., Downey, G., & Duponchel, L. (2014). Simultaneous data pre-
651 processing and svm classification model selection based on a parallel genetic
652 algorithm applied to spectroscopic data of olive oils. Food chemistry, 148 ,
AC
653 124–130.
654 Eiben, Á. E., & Smit, S. K. (2011). Evolutionary algorithm parameters and
655 methods to tune them. In Autonomous search (pp. 15–36). Springer.
37
ACCEPTED MANUSCRIPT
656 Emary, E., Zawbaa, H. M., & Hassanien, A. E. (2016). Binary grey wolf opti-
657 mization approaches for feature selection. Neurocomputing, 172 , 371–381.
658 Frohlich, H., Chapelle, O., & Scholkopf, B. (2003). Feature selection for sup-
port vector machines by means of genetic algorithm. In Tools with Artificial
T
659
IP
661 142–148). IEEE.
CR
662 Gao, X., & Hou, J. (2016). An improved svm integrated gs-pca fault diagnosis
663 approach of tennessee eastman process. Neurocomputing, 174 , 906–911.
664 Güraksın, G. E., Haklı, H., & Uğuz, H. (2014). Support vector machines clas-
665
Houari, R., Bounceur, A., Kechadi, M.-T., Tari, A.-K., & Euler, R. (2016).
M
669
672 Hsu, C.-W., Chang, C.-C., Lin, C.-J. et al. (2003). A practical guide to support
673 vector classification, .
PT
674 Huang, C.-L., & Wang, C.-J. (2006). A ga-based feature selection and param-
675 eters optimizationfor support vector machines. Expert Systems with applica-
CE
677 Inbarani, H. H., Bagyamathi, M., & Azar, A. T. (2015). A novel hybrid feature
AC
678 selection method based on rough set and improved harmony search. Neural
679 Computing and Applications, 26 , 1859–1880.
680 Ji, Y., Chen, Y., Fu, H., & Yang, G. (2017). An enkf-based scheme to optimize
681 hyper-parameters and features for svm classifier. Pattern Recognition, 62 ,
682 202–213.
38
ACCEPTED MANUSCRIPT
683 Jiang, J., Jiang, T., & Zhai, S. (2014). A novel recognition system for human
684 activity based on wavelet packet and support vector machine optimized by
685 improved adaptive genetic algorithm. Physical Communication, 13 , 211–220.
Kanamori, T., Takeda, A., & Suzuki, T. (2013). Conjugate relation between loss
T
686
IP
688 Learning Research, 14 , 1461–1504.
CR
689 Katrutsa, A., & Strijov, V. (2017). Comprehensive study of feature selection
690 methods to solve multicollinearity problem according to evaluation criteria.
691 Expert Systems with Applications, 76 , 1–11.
692
693
US
Khanesar, M. A., Teshnehlab, M., & Shoorehdeli, M. A. (2007). A novel bi-
nary particle swarm optimization. In Control & Automation, 2007. MED’07.
AN
694 Mediterranean Conference on (pp. 1–6). IEEE.
698 Khokhar, S., Zin, A. A. M., Memon, A. P., & Mokhtar, A. S. (2017). A new
ED
699 optimal feature selection algorithm for classification of power quality dis-
700 turbances using discrete wavelet transform and probabilistic neural network.
701 Measurement, 95 , 246–259.
PT
702 Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional
703 methods and a new algorithm. In AAAI (pp. 129–134). volume 2.
CE
704 Kivijärvi, J., Fränti, P., & Nevalainen, O. (2003). Self-adaptive genetic algo-
705 rithm for clustering. Journal of Heuristics, 9 , 113–129.
AC
706 Krishnasamy, G., Kulkarni, A. J., & Paramesran, R. (2014). A hybrid approach
707 for data clustering based on modified cohort intelligence and k-means. Expert
708 systems with applications, 41 , 6009–6016.
39
ACCEPTED MANUSCRIPT
709 Krstajic, D., Buturovic, L. J., Leahy, D. E., & Thomas, S. (2014). Cross-
710 validation pitfalls when selecting and assessing regression and classification
711 models. Journal of cheminformatics, 6 , 10.
Kulkarni, A. J., Baki, M. F., & Chaouch, B. A. (2016). Application of the cohort-
T
712
IP
714 problems. European Journal of Operational Research, 250 , 427–447.
CR
715 Kulkarni, A. J., Durugkar, I. P., & Kumar, M. (2013). Cohort intelligence: a
716 self supervised learning behavior. In Systems, Man, and Cybernetics (SMC),
717 2013 IEEE International Conference on (pp. 1396–1400). IEEE.
718
719
US
Kulkarni, A. J., & Shabir, H. (2016). Solving 0–1 knapsack problem using
cohort intelligence algorithm. International Journal of Machine Learning and
AN
720 Cybernetics, 7 , 427–441.
721 Li, L., Jiang, W., Li, X., Moser, K. L., Guo, Z., Du, L., Wang, Q., Topol, E. J.,
722 Wang, Q., & Rao, S. (2005). A robust hybrid between genetic algorithm
M
723 and support vector machine for extracting an optimal feature gene subset.
724 Genomics, 85 , 16–23.
ED
725 Lin, K.-C., Huang, Y.-H., Hung, J. C., & Lin, Y.-T. (2015). Feature selection
726 and parameter optimization of support vector machines based on modified cat
727 swarm optimization. International Journal of Distributed Sensor Networks,
PT
728 11 , 365869.
729 Lin, S.-W., Ying, K.-C., Chen, S.-C., & Lee, Z.-J. (2008). Particle swarm
CE
732 Liu, T., Zhang, L., & Zhang, J. (2013). Study of binary artificial bee colony
733 algorithm based on particle swarm optimization. Journal of Computational
734 Information Systems, 9 , 6459–6466.
40
ACCEPTED MANUSCRIPT
735 López, J., & Maldonado, S. (2017). Group-penalized feature selection and robust
736 twin svm classification via second-order cone programming. Neurocomputing,
737 .
T
738
IP
740 Maldonado, S., Pérez, J., & Bravo, C. (2017). Cost-based feature selection for
CR
741 support vector machines: An application in credit scoring. European Journal
742 of Operational Research, 261 , 656–665.
743 Martins, M., Costa, L., Frizera, A., Ceres, R., & Santos, C. (2014). Hybridiza-
744
745
US
tion between multi-objective genetic algorithm and support vector machine
for feature selection in walker-assisted gait. Computer methods and programs
AN
746 in biomedicine, 113 , 736–748.
747 Milgram, J., Cheriet, M., & Sabourin, R. (2006). one against one or one against
748 all: Which one is better for handwriting recognition with svms? In Tenth
M
750 Moradi, P., & Rostami, M. (2015). A graph theoretic approach for unsupervised
ED
753 Noraini, M. R., & Geraghty, J. (2011). Genetic algorithm performance with
754 different selection strategies in solving tsp, .
CE
755 Qiu, C., Liu, M., & Gong, W. (2013). Differential evolution with tournament-
756 based mutation operators. International Journal of Computer Science Issues,
757 10 , 180–187.
AC
758 Samma, H., Lim, C. P., Saleh, J. M., & Suandi, S. A. (2016). A memetic-
759 based fuzzy support vector machine model and its application to license plate
760 recognition. Memetic Computing, 8 , 235–251.
41
ACCEPTED MANUSCRIPT
T
764
765 nary problems with a gsa-svm hybrid system. Mathematical and Computer
IP
766 Modelling, 57 , 270–278.
CR
767 Vesterstrom, J., & Thomsen, R. (2004). A comparative study of differential
768 evolution, particle swarm optimization, and evolutionary algorithms on nu-
769 merical benchmark problems. In Evolutionary Computation, 2004. CEC2004.
770
771
US
Congress on (pp. 1980–1987). IEEE volume 2.
Wang, X., Fan, N., & Pardalos, P. M. (2015). Robust chance-constrained sup-
AN
772 port vector machines with second-order moment information. Annals of Op-
773 erations Research, (pp. 1–24).
774 Wu, Y.-Q., Han, F., & Ling, Q.-H. (2016). An improved ensemble extreme
M
777 Xu, H., & Mannor, S. (2012). Robustness and generalization. Machine learning,
778 86 , 391–423.
PT
779 Xue, Z., Ming, D., Song, W., Wan, B., & Jin, S. (2010). Infrared gait recognition
780 based on wavelet transform and support vector machine. Pattern recognition,
781 43 , 2904–2910.
CE
782 Zhang, X., Qiu, D., & Chen, F. (2015). Support vector machine with parameter
783 optimization by a novel hybrid method and its application to fault diagnosis.
AC
785 Zhu, X., Huang, Z., Yang, Y., Shen, H. T., Xu, C., & Luo, J. (2013). Self-taught
786 dimensionality reduction on the high-dimensional small-sized data. Pattern
787 Recognition, 46 , 215–229.
42
ACCEPTED MANUSCRIPT
788 Zorarpacı, E., & Özel, S. A. (2016). A hybrid approach of differential evolution
789 and artificial bee colony for feature selection. Expert Systems with Applica-
790 tions, 62 , 91–103.
T
IP
CR
US
AN
M
ED
PT
CE
AC
43