Wrink

Intelligent Data Analysis 17 (2013) 803–835 803
DOI 10.3233/IDA-130608
IOS Press
A novel feature subset selection algorithm

based on association rule mining
Guangtao Wang and Qinbao Song∗
Department of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi, China
Abstract. In this paper, a novel feature selection algorithm FEAST is proposed based on association rule mining. The proposed
algorithm first mines association rules from a data set; then, it identifies the relevant and interactive feature values with the
constraint association rules whose consequent is the target concept, detects and eliminates the redundant feature values with
the constraint association rules whose consequent and antecedent are both of single feature value. Finally, it obtains the feature
subset by mapping the feature values to the corresponding features. As the support and confidence thresholds are two important
parameters in association rule mining and play a vital role in FEAST, a partial least square regression (PLSR) based threshold
prediction method is presented as well. The effectiveness of FEAST is tested on both synthetic and real world data sets, and the
classification results of five different types of classifiers with seven representative feature selection algorithms are compared.
The results on the synthetic data sets show that FEAST can effectively identify irrelevant and redundant features while reserving
interactive ones. The results on the real world data sets show that FEAST outperforms other feature selection algorithms in terms
of classification accuracies. In addition, the PLSR based threshold prediction method is performed on the real world data sets,
and the results show it works well in recommending proper support and confidence thresholds for FEAST.
Keywords: Feature subset selection, association rule, threshold prediction, statistical metrics of data sets
1. Introduction
Feature subset selection, which is a technique of identifying the most salient features for learning, is an
important research topic in machine learning and data mining. It can not only reduce the dimensionality
of the data, but also improve a learner in terms of learning performance, generalization capacity and
model simplicity.
Generally, feature subset selection aims at removing irrelevant and redundant features as many as
possible [1]. This is due to the fact that irrelevant features do not contribute to the predictive accuracy [2],
and redundant features do not contribute to getting a better predictor for that the most information they
carried is already present in other feature(s) [3]. Thus, a number of feature subset selection algorithms
have been proposed to handle the irrelevant features or/and redundant features. Of these algorithms,
some of them can effectively remove irrelevant features but fail to identify the redundant [4–6], yet
some of others can remove the irrelevant features while taking the redundant ones into account [7–10].
However, feature interaction is a nonnegligible issue in practice. It has been noticed in feature subset
selection recently [11,12]. We can illustrate it by the following example. Suppose F1 ⊕ F2 = C , where
F1 and F2 are two boolean variables, C denotes the target concept, and ⊕ represents the xor operation.
∗
Corresponding author: Qinbao Song, Department of Computer Science and Technology, Xi’an Jiaotong University, No.28,
Xianning West Road, Xi’an, Shaanxi 710049, China. Tel.: +86 029 82668645; Fax: +86 029 82668645; E-mail: qbsong@
mail.xjtu.edu.cn.
1088-467X/13/$27.50
c 2013 – IOS Press and the authors. All rights reserved
804 G. Wang and Q. Song / A novel feature subset selection algorithm based on association rule mining
F1 or F2 is irrelevant with C when each is individually evaluated, but they become very relevant when
we combine them together. Therefore, removing the interactive features will lead to poor predictive
accuracy. The feature subset selection algorithms should eliminate the irrelevant and redundant features
while considering the feature interaction. Unfortunately, only a few of them can deal with not only
irrelevant and redundant features but also interactive ones [11,12].
Association rule mining can discover the interesting associations among data items [13], hence it has
been used to build classifiers with better classification accuracy compared with other types of classi-
fiers [14–18]. Especially, it also has been employed to select feature subset recently by Xie et al. [19].
However, they only evaluated the relevance between feature(s) and target concept without considering
redundant and interactive features.
An association rule is an expression of A ⇒ C , where A (Antecedent) and C (Consequent) are both
of itemset. If we view A as the feature(s) and C as the feature(s)/the target concept, association rules
can reveal the dependencies between either feature(s) and feature(s) or feature(s) and the target concept.
Therefore, it is reasonable to devise an association rule mining based method for feature subset selection.
In this article, we propose a Feature subset sElection Algorithm based on aSsociaTion rule mining
(FEAST)1 , which can eliminate the irrelevant and redundant features while taking the feature interaction
into consideration. FEAST uses association as the measure to evaluate the relativity between feature(s)
and the target concept, which is quite different from the traditional measures, such as the consistency
measure [9,12,20–23], dependence measure [8,10,24], distance measure [5,6,25,26] and information
theory measure [27–30]. The association measure evaluates irrelevant, redundant and interactive features
in a uniform way, and it is at least a potential alternative for feature subset selection.
The proposed feature subset selection algorithm FEAST is tested upon both five synthetic data sets
and 35 real world data sets. The experimental results show that, FEAST can effectively identify the
relevant features while taking the redundant and interactive features into account. Compared with other
seven representative feature subset selection algorithms, FEAST not only reduces the feature number,
but also improves the performance of the five well-known different types of classifiers.
The rest of the article is organized as follows: In Section 2, we describe the related work. In Sec-
tion 3, we present the new feature subset selection algorithm FEAST. In Section 4, we introduce the
proposed support and confidence threshold prediction method for FEAST. In Section 5, we provide the
experimental results. Finally, we summarize our work and draw some conclusions in Section 6.
2. Related work
Feature subset selection has been an active research topic since 1970’s, and a great deal of research
work has been published.
Of the existing feature selection algorithms, most of them can effectively identify the irrelevant fea-
tures based on different evaluation functions. But not all of them can eliminate the redundant features
while taking the feature interaction into consideration [31,32]. Thus, the existing feature subset selec-
tion algorithms can be generally grouped into three categories: i) the algorithms only handle irrelevant
features; ii) the algorithms handle both irrelevant and redundant features; and iii) the algorithms handle
irrelevant and redundant features while taking feature interaction into consideration.
1
The corresponding software can be obtained via qbsong@mail.xjtu.edu.cn.
G. Wang and Q. Song / A novel feature subset selection algorithm based on association rule mining 805
Traditionally, feature subset selection research has focused on searching for relevant features. Feature
weighting/ranking algorithms [33–37] weigh features individually and rank them based on their rele-
vance to the target concept. A well-known example is Relief [6], which weighs each feature according
to its ability to discriminate instances from different targets based on distance-based criteria function.
However, it is ineffective at removing redundant features as two predictive but highly correlated features
are both likely to be highly weighted [27]. Relief-F [5] is an extension of Relief, enabling the method to
work with noisy and incomplete data sets and to deal with multi-class problems, but it still fails to handle
the redundant features. Liu and Setiono [35] proposed to rank features using Chi-Square statistic. This
statistic evaluates features individually, so it is also unable to identify redundant features. ElAlami [38]
proposed to select feature subset from trained neural network using the Genetic Algorithm (GA). The
GA is used to find the optimal relevant features, which maximize the output function of trained neural
network. Unfortunately, it is also incapable of removing redundant features.
Besides irrelevant features, redundant features also affect the speed and accuracy of learning algo-
rithms and thus should be eliminated as well [27,39]. CFS [8], FCBF [10], MOD-Tree [40], MIFS [7]
and CMIM [41] are examples that take the redundant features into consideration. CFS [8] is achieved
based on the hypothesis that a good feature subset contains features highly correlated with the target, yet
uncorrelated with each other. FCBF [10] gives the definition of redundant peer between features based
on the symmetric uncertainty measure, and presents a framework of efficient feature selection via rele-
vance and redundancy analysis. MOD-Tree [40] identifies the redundant features based on the oblivious
decision tree. MIFS [7] uses the mutual information as the evaluation function to identify the relevant
features and eliminate the redundant features. CMIM [41] iteratively picks features by maximizing their
mutual information with the class to predict. It does not select any feature being similar to the already
picked ones, as it does not provide additional information about the class to predict. All these algo-
rithms can effectively remove irrelevant and redundant features but not take into the feature interaction
consideration.
Moreover, feature interaction is drawing more attention in recent years. That is, features which ap-
pears irrelevant singly may become highly relevant when combined with others. There can be two-way,
three-way or complex multi-way interactions among features [42]. Thus, Jakulin and Bratko [11] suggest
using interaction gain as a heuristic to detect feature interaction, their algorithms can detect 2-way (one
feature and the class) and 3-way (two features and the class) interactions. Zhao and Liu [12] propose an
algorithm INTERACT where the feature interactions are implicitly handled by a carefully designed fea-
ture evaluation metric and a search strategy with a specially designed data structure. Chanda et al. [30]
believe statistical interactions can capture the multivariate inter-dependencies among features, and em-
ploy these statistical interactions to improve feature subset selection.
Recently, Xie et al. [19] have utilized the association rules for feature subset selection. They calcu-
late the antecedent union set of the corresponding association rules whose consequences are the same
target concept, and use a predefined parameter to control the size of the selected feature subset. Unfortu-
nately, they just detect relevant features and do not handle redundant and interactive features. In contrast,
our algorithm aims to eliminate the irrelevant and redundant features, and takes the multi-way feature
interactions into consideration meanwhile, hence is quite different from the front algorithms.
3. Feature subset selection algorithm
In this section, we first introduce the concepts of several kinds of constraint association rules, including
strong association rule, classification association rule and atomic association rule. Then, based on these
Table 1
Data set Dxor
F1 F2 F3 F4 Y
0 1 0 1 1
0 0 1 1 0
1 1 0 1 0
1 0 1 1 1
0 0 1 0 0
0 1 0 0 1
1 0 1 1 1
1 1 0 1 0
rules, we give the definitions of irrelevant, redundant and interactive features. Afterward, we present the
proposed feature subset selection algorithm.
3.1. Strong, classification and atomic association rules
Association rule mining searches for interesting relationships among items in a data set D . Let I =
{i1 , i2 , · · · , ik } be a set of items, an association rule is an implication of form A ⇒ B , where A ⊂ I ,
B ⊂ I , and A ∩ B = φ.
The support, confidence and lift are three important measures of a rule’s interestingness.
1. The support of rule A ⇒ B is the percentage of instances in D that contain both A and B , denoted
as Support (A ⇒ B) = P (A ∪ B); this measure reflects the rule’s usefulness whose value range
is (0, 100%].
2. The confidence of rule A ⇒ B is the percentage value that shows how frequently B occurs among
all the instances containing A. It is denoted as Confidence(A ⇒ B) = P (B|A); this measure
reflects the rule’s certainty whose value range is (0, 100%].
3. The lift of rule A ⇒ B is denoted as Lift (A ⇒ B) = P (B|A)/P (B); this measure reflects the
rule’s importance whose value range is (0, ∞).
Typically, association rules are considered interesting if they satisfy minimum support threshold (min-
Supp), minimum confidence threshold (minConf) and minimum lift threshold (minLift). Usually minLift
is set to 1, minSupp and minConf can be set by users or domain experts. Based on these three thresholds,
strong association rule (SAR) can be defined as follow.
Definition 1. Strong association rule (SAR). A rule r : A ⇒ C is a strong association rule if and only if:
Support (r) > minSupp ∧ Confidence(r) > minConf ∧ Lift(r) > minLift.
For the sake of introducing classification association rule (CAR) and atomic association rule (AAR),
we first give the concepts of feature value itemset (FVIS) and target value itemset (TVIS).
Let D = {d1 , d2 , · · · , dn } be a data set of n instances, F = {F1 , F2 , · · · , Fm } be the feature space of
D with m features, where Fi is the domain of ith feature and Y be the target concept. The instance di of
D can be denoted as a tuple (Xi , yi ), where Xi ∈ F1 × F2 × · · · × Fm , and yi ∈ Y . Then feature value
itemset FVIS = m i=1 Fi containing all possible feature values, and the target value item set TVIS = Y .
Take the data set Dxor in Table 1 for example, the feature space F = {F1 , F2 , F3 , F4 }, the target
concept Y = F1 ⊕ F2 , F3 = F2 and F4 is a feature whose values are randomly assigned. The domain
of each feature in F is {0, 1}, i.e., F1 = F2 = F3 = F4 = {0, 1}, and Y = {0, 1}. Then the FVIS and
TVIS can be obtained as follows: FVIS = {F1 = 0, F1 = 1, F2 = 0, F2 = 1, F3 = 0, F3 = 1, F4 =
1, F4 = 0}, TVIS = {Y = 0, Y = 1}.
With the definitions of FVIS and TVIS, classification association rule (CAR) and atomic association
rule (AAR) are defined as follows.
Definition 2. Classification association rule (CAR). A rule r : A ⇒ C is a classification association rule
if and only if:
r is a SAR ∧ A ⊆ FVIS ∧ C ⊆ TVIS ∧ | C |= 1 .
Here, the |Z| denotes the cardinality of set Z .
Definition 3. Atomic association rule (AAR). A rule r : A ⇒ C is an atomic association rule if and only
if:
r is a SAR ∧ A ⊆ FVIS ∧ C ⊆ FVIS ∧ | A |=| C |= 1 .
All CARs constitute classification association rule set (CARset). All AARs constitute atomic as-
sociation rule set (AARset). In order to better understand the definitions based on CAR and AAR in
the following section, the CARset and AARset of the data set Dxor (see Table 1) are generated under
minSupp = 20%, minConf = 70% and minLift = 1 as examples (see details in Appendix A).
3.2. Definitions of relevant, redundant and interactive features
In order to define the relevant, redundant and interactive features based on association rules, we firstly
give the definitions of relevant feature value, redundant feature value and feature value interaction based
on association rules.
Definition 4. Relevant feature value (RelFV). A specific value fij of feature Fi is relevant to the target
concept Y if and only if:
∃r ∈ CARset, fij ∈ r. Antecedent.
Otherwise, fij is said to be an irrelevant feature value (iRelFV).
From Definition 4 we can know that the feature values appearing in the antecedent of a rule r ∈
CARset are relevant feature values; otherwise, the feature values that never appear in the antecedent of
any rule r ∈ CARset are irrelevant feature values.
As we know, classification association rules have been extensively used in classification [14–18], and
these classification algorithms usually possess relatively better classification accuracy. This indicates
that the rules in CARset can effectively reflect the relationship between features and the target concept.
The feature values appearing in the antecedents of CARs are necessary and related to the target concept.
So it is reasonable to identify the relevant feature values by Definition 4.
For example, in data set Dxor , F1 = 0 is relevant since it appears in the antecedent of CAR: {F1 =
0 ∧ F3 = 0} ⇒ {Y = 0}, and F4 = 0 is irrelevant because there does not exist any CAR ∈ CARset
whose antecedent contains F4 = 0 (see Appendix A). What’s more, F1 is relevant to the target concept
Y of Dxor as Y = F1 ⊕ F2 , and F4 is irrelevant to Y since its values are randomly preassigned. This
indicates the Definition 4 can be used to detect irrelevant values from the relevant ones.
However, the feature value appearing in a CAR’s antecedent maybe redundant. i.e., two closely-
correlated feature values will be simultaneously appearing in the CAR’s antecedent. This is because
that the association rules are mined based on frequent itemset mining (FIM) [43], but FIM cannot detect
the redundant items (i.e., feature values). Because that for a given feature value, if it is frequent and
selected into a frequent itemset, then a feature value being redundant to it will be frequent and selected
into the frequent itemset as well. To overcome this problem, the redundant feature value is defined as
follow.
Definition 5. Redundant feature value (RedFV). A specific value f of a feature value set (FVset) is
redundant if and only if:
∃r ∈ AARset, {f } = r. Consequent ∧ r. Antecedent ⊆ FVset.
From Definition 5 we can know that, of a given feature value set, a feature value is redundant when it
appears in the consequent of a rule in AARset and the rule’s antecedent is in the given feature value set
as well.
As we know, for a redundant feature value, the information it carries is already present in other feature
value. This indicates that it is strongly related to and can be replaced by another feature value. What’s
more, AAR can be used to explore the correlation between two feature values. Thus, Definition 5 based
on AAR can be used to detect redundant values.
For example, in the data set Dxor , of a given feature value set {F2 = 1, F3 = 0}, according to
Definition 5, F3 = 0 is redundant as it appears in the consequent of the AAR {F2 = 1} ⇒ {F3 = 0}
(see Appendix A). What’s more, the values of F3 are redundant since F3 = F2 can be viewed as a copy
of F2 . This denotes it is reasonable to identify the redundant feature values by Definition 5.
It is noticed that Definition 5 only detects the two-way value redundancy (the redundancy between
two values). Of course, there might exist multi-way feature value redundancy (the redundancy among
multiple feature values). However, detecting all the multi-way value redundancy is a combination explo-
sion problem since we need to list all possible combinations. It is impracticable even if the feature space
is of a medium size. Therefore, we mainly focus on the two-way redundancy in this paper.
Suppose FVset = {f1 , f2 , · · · , fk } is a feature value set with k feature values, (A ⊂ FVset) = φ and
B = FVset −A, and y is a value of the target Y . Let Conf(r ) be the confidence of an association rule r ,
and rF , rA and rB be the CARs of FVset ⇒ {Y = y}, A ⇒ {Y = y} and B ⇒ {Y = y}, respectively.
Then, the interactive feature value can be defined as follow.
Definition 6. kth interactive feature value. The values in FVset are said to interact with each other if
and only if:
Conf (rF ) > Conf (rA ) ∧ Conf (rF ) > Conf (rB ).
The confidence of an association rule reflects the description ability of the rule’s antecedent to its
consequent. The higher confidence means the stronger description ability. In Definition 6, the confidence
of the rule rF is greater than those of the rules rA and rB . This means that although either feature value
set A or B is not helpful enough in describing the target concept, FVset = A ∪ B is more useful. In this
case, feature value sets A and B are said to interact with each other.
According to Definition 4, the CARs usually have high confidence since their confidence should be at
lest greater than minConf. This implies that all the rules with high confidence are included in CARset.
In Definition 6, it is impossible that rA or rB is a CAR but rF is not, since Conf(rF ) is greater than both
Conf(rA ) and Conf(rB ). Therefore, the antecedents of rules in CARset will contain all possible feature
value interactions. That is, the feature value interactions can be reserved by the rules in CARset.
For example, in the association rules generated from data set Dxor , Conf({F1 = 0∧F2 = 1} ⇒ {Y =
1}) = 100% is greater than both Conf({F1 = 0} ⇒ {Y = 1}) = 50% and Conf({F2 = 1} ⇒ {Y =
1}) = 50%, so there is a 2th interactive feature value in feature value set {F1 = 0, F2 = 1}. Actually,
there exists feature interaction between F1 and F2 since the target concept Y of Dxor is F1 ⊕ F2 . This
means the Definition 6 works well in identifying the interactive feature values of Dxor .
According to the definitions of relevant feature value (RelFV), irrelevant feature value (iRelFV),
redundant feature value (RedFV) and interactive feature value, we define relevant feature, redundant
feature and interactive feature in the following.
Definition 7. Relevant feature (RelFea). Feature Fi is relevant to the target concept Y if and only if:
∃fij ∈ Fi , {fij | fij is a RelFV} = φ.
Otherwise, Fi is an irrelevant feature (iRelFea).
Definition 7 shows that a feature is relevant when at least one of its values is relevant, and all the
values of an irrelevant feature are irrelevant. According to Definition 4, a feature value is relevant only
when it appears in the antecedent of a CAR. Its absence would affect the confidence of the CAR, i.e., the
prediction power of the CAR. On the other hand, under the classical definition of relevant feature [2], a
feature is relevant only when its removal will change the prediction power. So this definition can identify
relevant features.
For example, the relevant feature F1 in the example data set Dxor would be picked up as a relevant
one under this definition since F1 = 0 is a relevant value.
Definition 8. Redundant feature (RedFea). Feature Fi is redundant if and only if:
All values of feature Fi are RedFVs, or at least one is RedFV and the rest are iRelFVs.
Definition 8 indicates that a feature is redundant due to two reasons: i) each value of this feature is
redundant; ii) some values of this feature are redundant and some are irrelevant. As irrelevant values
provide no information about the target concept and redundant values provide the information which is
already present in the other values, they are all useless in describing the target concept. This is consistent
with the property of the classical definition of redundant feature in [3,32].
For example, in data set Dxor (see Table 1), according to Definition 8, the feature F3 is redundant for
that F3 = 0 and F3 = 1 are both redundant identified by Definition 5; the feature F4 is also redundant
since F4 = 0 is irrelevant and F4 = 1 is redundant detected by Definition 7 and Definition 5, respectively.
Definition 9. Interactive feature. Let Fset = {F1 , F2 , · · · , Fk } be a feature subset with k features, and
VAset be its value-assignment sets. Features F1 , F2 , · · · , Fk are said to interact with each other if and
only if:
∃ fset ∈ VAset, fset is a FVset with kth interactive feature value.
As we known, there is an intrinsic relationship between a feature and its values. The properties of a
feature subset can be learned by its value-assignment. Thus, for a given feature subset, it is reasonable
that the interaction among this feature subset could be implied and further studied by that among its
value-assignment. Inspired by this, Definition 9 based on feature value interaction is proposed to identify
feature interaction.
For example, in data set Dxor (see Table 1), according to Definition 9, features F1 and F2 would
be identified as interactive features as there exist 2th interactive feature values in the feature value set
{F1 = 0, F2 = 1} detected by Definition 6. It is consistent with the target concept Y = F1 ⊕ F2 of
Dxor .
3.3. Feature subset selection algorithm
With the definitions of relevant, redundant and interactive features based on association rules, we
propose a novel feature subset selection algorithm FEAST, which searches for the relevant features
while taking the redundant features and feature interaction into consideration.
Figure 1 shows the framework of FEAST, which consists of four steps: i) Association rule mining,
ii) Relevant feature value set discovery, iii) Redundant feature value elimination and iv) Feature subset
identification.
Data set
Association rule mining

Classification Atomic Association
Association Rule Set Rule Set (AARset)
(CARset)
Relevant feature value set discovery

Relevant Feature Value
Set (RFVset)
Redundant feature value elimination
RFVSet without redundance
Feature subset identification
Feature subset
Fig. 1. Framework of FEAST.
1. Association rule mining

Constraint association rules are mined from the given data set according to the predetermined sup-
port threshold minSupp, confidence threshold minConf and lift threshold minLift. These association
rules consist of classification association rules and atomic association rules. Afterward, classifica-
tion association rule set (CARset) and atomic association rule set (AARset) are obtained.
2. Relevant feature value set discovery
By joining the antecedents of rules in CARset together, an initial relevant feature value set (RFVset)
reserving the interactive feature values, is achieved according to Definition 4 and Definition 6.
3. Redundant feature value elimination
A feature value is redundant means that the most information it provides is already present in
another feature value. That is, the feature could be implied and removed by another feature value.
For our purpose here, an AAR is used to identify this implication relation between two feature
values. The higher the confidence of the AAR, the stronger the implication. This means that the
AARs with higher confidence should be used to identify and eliminate redundant values firstly.
Thus, the AARs in AARset are sorted in descending order of rule’s confidence in advance.
For a given AAR r ∈ AARset with the highest confidence, the feature value in r ’s consequent
is eliminated from current RFVset. At the same time, according to Definition 5, a consequent
feature value of an AAR rule is redundant iff its antecedent feature value is in the current RFVset.
Therefore, after eliminating the feature value in RFVset, AARset should be updated by removing
rule r and the other rules whose antecedents are equal to the consequent of r. Repeat this procedure
until AARset is empty.
4. Feature subset identification
After eliminating redundant feature values, the RFVset without irrelevant and redundant feature
values is obtained. Meanwhile, step 2 shows that RFVset contains all interactive feature values
based on which the interactive features are defined (see details in Definition 9). Thus, according
to Definition 7, by mapping the feature values in RFVset to the corresponding features, the final
feature subset is identified, which not only retains relevant features and excludes irrelevant and
redundant features, but also takes feature interaction into consideration.
Algorithm 1 FEAST
Input: 9: AARset = AARset − {r};
D - the given data set; 10: if r.Antecedent ⊂ RFVset then
minSupp - the support threshold; 11: RFVset = RFVset − r.Consequent;
minConf - the confidence threshold; 12: for each r ∈ AARset do
Output: 13: if r .Antecedent == r.Consequent then
S - the selected feature subset; 14: AARset = AARset − {r };
//Part 1: Association rule mining; 15: end if
1: S ← φ, RFVset ← φ; 16: end for
2: [CARset, AARset]=FP_growth(D, minSupp, minConf); 17: end if
//Part 2: Relevant feature value discovery 18: end while
3: for each r ∈ CARSet do //Part 4: Feature subset identification
4: RFVset = RFVset ∪ r.Antecedent; 19: for each feature value v ∈ RFVset do
5: end for 20: if v ∈ value set of Feature F then
//Part 3: Redundant feature value elimination 21: S = S ∪ {F };
6: Sort(AARset); 22: end if
7: while AARset = φ do 23: end for
8: r = the first rule in AARset; 24: return S
Algorithm 1 provides the pseudo-code description of FEAST. Of the input parameters, support thresh-
old minSupp and confidence threshold minConf are used as the constraint condition to achieve strong
association rules SARs (see Definition 1).2
The pseudo-code consists of four parts. In part 1 (lines 1–2), CARset and set AARset are mined by
function FP_growth() [43] on the given data set D according to minSupp and minConf. In part 2 (lines
3–5), the union of the antecedents of the association rules in CARset constitutes the relevant feature
value set RFVset. Part 3 (lines 6–18) eliminates the redundant feature values in RFVset, where function
Sort() sorts the rules in AARset in descending order of rule’s confidence. Firstly, the first rule r is chosen
and removed from AARset. Then if its antecedent is a subset of the current RFVset, the value in r’s
consequent is eliminated from RFVset; meanwhile, the rules whose antecedents are identical with r’s
consequent are also removed from AARset. This process repeats until that AARset is empty. Part 4 (lines
19–24) achieves the selected feature subset S according to the feature values in RFVset.
Time complexity of FEAST. In part 1, the CARset and AARset are mined by function FP_growth()
whose time consumption is closely related to the value of minSupp [43]. The time complexity of this
part can be represented as O(f(minSupp, D)), where f(minSupp, D) is a function of minSupp and D, which
increases with the decrease of minSupp or increase of the size of D. For part 2, once a CAR is generated
by FP-growth, its antecedent could be merged into RFVset meanwhile, so the consumed time of this part
can be ignored. For part 3, the main time consummation is the process of sorting the rules in AARset,
so the time complexity of this part is O(V · log V ) (by quick sort), where V is the number of rules in
2
It should be noted that the lift threshold minLift is set to constant value 1 in our algorithm FEAST. Because from the
definition of lift, we can know if lift (A ⇒ B) 1, i.e., P (B) P (B|A), the priori probability of B is greater than the
posterior probability with evidence A. This means the antecedent A is not helpful in describing the consequent B, and this kind
of rule is useless.
AARset. The time complexity of part 4 is O(K), where K is the number of feature values in the final
RFVset whose maximum value is the number of all possible feature values in D .
In summary, the time complexity of FEAST is O(f(minSupp, D) + O(V · log V ) + O(K). Since
part 1 is the major time consumer in the worst case, the efficiency of FEAST largely depends on that of
association rule mining, and it is closely related to the setting of the support threshold minSupp.
4. Support and confidence threshold prediction method
In this section, we first present the general view of the proposed support and confidence threshold
prediction method, then introduce the data metrics used to predict the support and confidence thresholds
for FEAST. Afterward, we provide an approach for identifying the most suitable support and confidence
thresholds. Finally, we give the threshold prediction model construction method.
4.1. General view of the method
For the proposed feature subset selection algorithm FEAST, there are two parameters known as sup-
port and confidence thresholds, which should be determined beforehand. This is because different sup-
port and confidence thresholds result in different feature subsets, but not all of them can obtain the best
performance for the further learning (such as classification). Moreover, these two parameters also affect
the executing efficiency of FEAST. Therefore, the predetermination of the most suitable support and
confidence thresholds is a critical problem.
In machine learning, a great deal of research has explored the relationships between classification
algorithm performance and data set characteristics [44–48]. These research firstly uses some kind of
statistical and information theoretical measures to characterize data sets, then tries to capture the rela-
tionships between the measured data set characteristics and the classification algorithm performance by
decision tree, instance based rule induction methods or regression models. Afterward, these relationships
are used to recommend appropriate classification algorithms for a given new data set.
On top of these research, we believe that there is some kind of relationship between data set metrics
and the most suitable support and confidence thresholds for FEAST, and therefore propose a support and
confidence threshold prediction method.
In the proposed threshold prediction method, data set metrics for a new data set is first extracted and
then used to predict the most suitable support and confidence thresholds for FEAST according to the
prediction model constructed from a number of existing data sets. Figure 2 shows the framework of this
method.
The proposed method consists of the Prediction model construction with historical data sets and the
Threshold prediction for a new data set. The brief description of these two parts is as follows.
1. Prediction model construction
The prediction model reflects the relationship between data set metrics and the most suitable sup-
port and confidence thresholds for FEAST; it is constructed using a regression method. The pre-
diction model construction involves the following two steps:
1) For each of the given historical data sets, the corresponding metrics and the most suitable sup-
port and confidence thresholds for FEAST are extracted and identified, respectively.
2) Data sets are grouped into clusters according to their metrics. For each cluster, by viewing the
data set metrics as independent variables and the most suitable thresholds for FEAST as depen-
dent variables, the prediction model is built with the partial least square regression (PLSR) [54]
method.
Model construction Threshold prediction
Data sets Data set
Extract metrics
Extracting data set Identifying suitable
metrics thresholds for FEAST
Metrics
Metrics Thresholds
Predicting
Modeling
Predicted support &
Prediction model confidence thresholds
Fig. 2. Threshold prediction framework.
2. Threshold prediction
For a new data set, the corresponding metrics is extracted, and the data set is classified into one
of the clusters according to the metrics. Then the prediction model of that cluster along with the
metrics of the data set are used to predict the most suitable support and confidence thresholds for
FEAST.
As the threshold prediction is straightforward, we just focus on the prediction model construction in
the following subsections.
4.2. Data set metrics
Data set metrics is something like feature portraying an instance, as it characterizes a data set. Data
set metrics is extracted from a data set, and should have the following properties: i) it is able to reserve
both structural and statistical essentials of the data set, ii) it is a unified formation of different types of
data sets, and iii) easy to acquire and be related to the feature selection process in FEAST.
As we know, the support and confidence of association rules are calculated according to the frequen-
cies of itemsets. Moreover, Tatti [49] used frequency of itemset to characterize data sets and further
to compute distances between them. However, searching all possible itemsets in a data set is difficult.
Therefore, in our method, the frequencies of the two kinds of item sets, which are 1- and 2- itemsets,3
are extracted and used to summarize a data set.
Meanwhile, the features identified by a feature subset selection algorithm should be related to the
target concept. Thus, measures that are used to describe the target concept should have some kind of
relationships with these features. There are two measures which can be employed to represent the target
concept: frequencies of the target concept values and mutual entropies [50] between the target concept
and features.
To summarize, for a data set, the four sequences extracted to describe it are as follows: i) a frequency
sequence of 1-itemsets; ii) a frequency sequence of 2-itemsets; iii) a frequency sequence of target con-
cept values; iv) a mutual entropy sequence.
3
The 2-itemset is composed of a target concept value and a feature value.
Table 2
Statistical measures for a sequence
Index Measure Notation
1 Arithmetic mean AM
2 Harmonic mean HM
3 Standard deviation Std
4 Minimum value Min
5 25% quantile Qua25
6 Median value Median
7 75% quantile Qua75
8 Maximum Max
9 Interquartile range IQR
10 Kurtosis K
11 Skewness S
Although the above introduced sequences can be used to describe a data set, unfortunately for the
different data sets, the lengthes of the four sequences are different as well. For example, for two data sets
with different number of features, the lengthes of their mutual entropy sequences are also different. It
is unable to compare different data sets with these sequences directly. In order to address this problem,
the 11 statistical measures listed in Table 2 are used to substitute each sequence. At the same time, the
entropy of the target concept and mean value of all features’ entropies are viewed as two metrics as well.
Finally, a total of 46 (4 × 11 + 2) metrics are used to describe data sets.
4.3. Identifying the most suitable support and confidence thresholds
In order to obtain the most suitable support and confidence thresholds for FEAST, brute-force search
is employed. It is a trivial but very general problem-solving technique that consists of systematically
enumerating all possible candidates for the solution and checking whether each candidate satisfies the
problem’s statement.
In the context of identifying the most suitable support and confidence thresholds for FEAST, firstly
for a given data set, different pairs of support and confidence thresholds are systematically enumerated.
Then, the FEAST algorithm with each of these candidate pairs and a number of different types of clas-
sification algorithms are performed on the given data set with cross-validation. Note that a number of
different types of classification algorithms are adopted, the intention is to ensure the generalization abil-
ity of the selected threshold pair and avoid the selected threshold pair being only suitable for certain
specific classification algorithm.
To compare two threshold pairs of support and confidence across all classification algorithms, the
Win/Draw/Loss record [51] and the mean accuracy are employed.
A Win/Draw/Loss record presents three values, corresponding to the numbers of classification algo-
rithms for which support and confidence threshold pair scPair1 obtains better, equal, or worse perfor-
mance than pair scPair2 on classification accuracy.
After obtaining the Win/Draw/Loss records and the mean accuracies across all classification algo-
rithms for all the support and confidence threshold pairs, we are able to identify the most suitable support
and confidence threshold pair(s) from the candidate list CL = { scPair1, scPair2, · · · , scPairn} (n ∈ N )
as follows:
1. Compare the Win/Draw/Loss counts between support and confidence threshold pair scPair1 and
each of scPairi (i = 2, 3, · · · , n). If Count(Win, scPair1:scPairi ) > Count(Loss, scPair1:scPairi )4
4
Count(Win/Loss, scPair1 :scPairi ) denotes the count of support and confidence threshold pair scPair1 won/lost to scPairi .
holds for each pair of {scPair1, scPairi}, then support and confidence threshold pair scPair1 is
selected.
2. Otherwise, the results are conflicting, i.e., candidates scPair1, scPairi, and scPairj are not defeated
by each other. That is, the case of Count(Win, scPair1:scPairi) > Count(Loss, scPair1:scPairi )
∧ Count(Win, scPairj :scPair1 ) > Count(Loss, scPairj :scPair1 ) ∧ Count(Win, scPairi :scPairj ) >
Count(Loss, scPairi:scPairj ) appears. In this situation, the corresponding mean performance over
all classification algorithms are referred. The support and confidence threshold pair with the highest
mean accuracy is selected.
3. Move the selected support and confidence threshold pair from the candidate list CL into the rank-
ing list RL. Repeat steps 1 and 2 until there is only one support and confidence pair in CL. The
ranking list RL = {scPair’1, scPair’2, · · · , scPair’n} (n ∈ N ) can be used to predict support and
confidence threshold pairs to a new data set. In general, the first pair represents the most suitable
thresholds, while in the case of existing several pairs with the same Win/Draw/Loss counts and
mean accuracies, all these pairs are selected.
The time complexity of the brute-force search for appropriate threshold pair(s) consists of two parts.
Firstly, for each pair of thresholds scP airi (1 i n), we run all the classifiers with FEAST,
and obtain the corresponding mean classification accuracies. The complexity of this part is O(n ·
(O(FEAST) + k1 O(classifieri ))), where k is the number of classifiers, O(FEAST) and O(classifieri )
represent the complexities of FEAST and the ith classifier, respectively. Afterward, we rank all these
threshold pairs according to the sort strategy introduced above and identify the most appropriate pairs.
The complexity of thispart is O(n2 ). Thus, the complexity to identify the appropriate threshold pairs is
O(n · (O(FEAST) + k1 O(classifieri ))) + O(n2 ), which is not only related to the number of the can-
didate pairs, but also to the complexities of FEAST and the training classifiers. Although this procedure
is time-consuming, it is able to exactly find out the most appropriate pair(s) of thresholds and guarantee
the reliability of the further constructed threshold prediction model.
4.4. Threshold prediction model construction
To explore the relationship between the data set metrics and the support and confidence thresholds
used in FEAST, a regression analysis method is employed to build the prediction models of support and
confidence thresholds.
The metrics variation of the different type data sets is significant. In order to improve the model quality,
firstly the data sets are grouped into clusters according to metrics, and then prediction models are built
for each of the clusters.
For each cluster, the prediction model is built with instances collected from a number of data sets.
Each instance is composed of two parts: the data set metrics, which can be viewed as the independent
variables, and the support and confidence thresholds, which can be regarded as the dependent variables.
Since there are 46 metrics and two thresholds, it is a multiple independent variables and multiple
dependent variables regression problem. Moreover, as the support and confidence of association rules are
independent with each other, the two-dependent-variable regression problem is divided into two single
dependent variable regression problems. Therefore, the prediction models of support and confidence
thresholds are constructed separately.
However, according to Section 4.3, it is possible that multiple suitable support and confidence thresh-
olds are selected for a single data set. Thus, the dependent variable takes a value set rather than a single
value. Unfortunately, existing regression methods can not handle the case with multiple dependent val-
ues in an instance [52,53]. In order to address this problem, the dependent variable is divided into three
variables: the minimum value, the mean value and the maximum value of the value set. Then, the predic-
tion models are built with the three variables and the metrics separately. The model with the minimum
square mean error is picked up as the final prediction model.
Generally, standard multiple regression requires (n 30 or n > 3k)5 [52,53] to guarantee the sta-
tistical validity. Moreover, the precision and robustness of the standard multiple regression model are
decreased when there exists multi-collinearity6 among independent variables [52]. Unfortunately, there
should exist multi-collinearity among the 46 metrics, and the condition of n 30 or n > 3k is not
always satisfied either. Therefore, standard regression methods will fail to handle these issues. Partial
least square regression (PLSR) [54] is a recent technique that generalizes and combines features from
principal component analysis and multiple regression. PLSR extracts a set of latent factors (using prin-
cipal components analysis) that maximize explanation of the covariance between the independent and
dependent variables, then uses these factors to build multiple regression model to predicts the dependent
variable(s). This method is particularly suited when there are more independent variables than the in-
stances and there is multi-collinearity among independent variables. Thus, PLSR is employed to build
support and confidence threshold prediction models.
5. Experimental results and analysis
In this section, firstly, we empirically evaluate the performance of FEAST, and present the experimen-
tal results compared with the other seven representative feature selection algorithms upon both synthetic
and real world data sets.
Afterward, we also provide the support and confidence threshold prediction results of the proposed
threshold prediction method.
5.1. Benchmark data sets
5.1.1. Synthetic data sets

In order to directly evaluate how well FEAST deals with irrelevant, redundant features and feature
interaction, five synthetic data sets with all the irrelevant, redundant and interactive features being known
are employed.
The first two data sets synData1 and synData2 are generated by the data generation tool RDG1 of the
data mining toolkit WEKA.7 The other three data sets about MONK’s problems are available from UCI
Machine Learning Repository [55]. The five data sets are described as follows.
1. synData1. There are 100 instances and 10 boolean features a0 , a1 , · · · , a9 . The target concept c is
defined by c = (a0 ∧ a1 ∧ a5 ) ∨ (a0 ∧ a1 ∧ a6 ∧ a8 ) ∨ (a0 ∧ a1 ∧ a5 ∧ a8 ) ∨ (a0 ∧ a1 ∧ a5 ∧ a8 ) ∨
(a5 ∧ a6 ∧ a8 ) ∨ (a0 ∧ a1 ).
2. synData2. There are 100 instances, 11 boolean features denoted as a0 , a1 , · · · , a9 and a redundant
feature r that is the copy of a5 . The target concept c is defined by c = a5 ∨ (a1 ∧ a6 ∧ a8 ).
3. MONK1. There are 432 instances and 6 features a1 , a2 , · · · , a6 . The target concept c is defined by
c = (a1 = a2 ) ∨ (a5 = 1).
5
Where n is the number of instances and k denotes the number of independent variables.
6
It is a statistical phenomenon in which two or more independent variables in a multiple regression model are highly corre-
lated.
7
http://www.cs.waikato.ac.nz/ml/weka/.
Table 3
Summary of the 35 benchmark data sets
Data Data # Target # Features # Instances Data Data # Target # Features # Instances
ID name values ID name values
1 Austra 2 14 690 19 Primary-tumor 22 17 339
2 Autos 7 22 205 20 Segment 7 18 2310
3 Cleve 2 12 303 21 Solar-flare1 6 12 323
4 Colic-orig 2 23 368 22 Sonar 2 21 208
5 Colic 2 22 368 23 Spectrometer 48 94 531
6 Credit-a 2 15 690 24 Splice 3 60 3190
7 Credit-g 2 15 1000 25 Vehicle 4 18 846
8 Cylinder 2 20 540 26 Vote 2 16 435
9 Flags 6 26 194 27 Vowel 11 13 990
10 Heart-c 5 11 303 28 Waveform 3 19 5000
11 Hepatitis 2 16 155 29 Wine 3 13 178
12 Ionosphere 2 33 351 30 Zoo 7 16 101
13 Labor 2 14 57 31 CLL-SUB-111 3 11340 111
14 Letter 26 15 20000 32 GLI-85 2 22283 85
15 Lymph 4 18 148 33 SMK-CAN-187 2 19993 187
16 Mfeat-pixel 10 240 2000 34 TOX-171 4 5748 171
17 Molecular 2 57 106 35 warpAR10P 10 2400 130
18 Mushroom 2 22 8124
exactly two of {a1 = 1, a2 = 1, · · · , a6 = 1}.
c = (a5 = 3 ∧ a4 = 1) ∨ (a5 = 4 ∧ a2 = 3). 5% class noise were added to the training set.
For each data set, the features appearing in the definition of the target concept are all relevant, while the
absent features are either redundant or irrelevant. The conjunctive terms in the target concept’s definition
imply feature interactions.
5.1.2. Real world data sets

To study the performance of feature subset selection algorithms, we have employed 35 benchmark
data sets which are available from UC Irvine Machine Learning Repository [55] and website of feature
selection data sets at Arizona State University.8 The statistical information of the 35 data sets is sum-
marized in Table 3. From this table we can see that the sizes of data sets vary between 57 and 20,000
instances, and the total number of original features is up to 22283.
Note that for data sets containing features with continuous values, the MDL discretization method [56]
is applied to discretize the continuous features.
5.2. Experimental results of feature subset selection
In this section, we firstly introduce the experiment setup, then present the results of feature selection
on synthetic and real world data sets, respectively. Finally, we give the sensitive analysis of the two
important thresholds minSupp and minConf used in FEAST.
5.2.1. Experiment setup

1. Seven representative feature selection algorithms are chosen to be compared with FEAST.
8
http://featureselection.asu.edu/datasets.php.
Of these algorithms, there are five well-known algorithms CFS (selecting features using corre-
lation based measure and Best-First search) [8], Consistency (selecting features using consistency
of feature subset in target concept values and Best-First search, “Consist” for short hereafter) [9],
FCBF (selecting features using symmetric uncertainty measure) [10], MIFS (selecting features us-
ing mutual information) [7] and FRanker (evaluating features by Chi-square statistic and selecting
features by hypothesis maximization) [35]. They can effectively identify irrelevant features, while
CFS, Consist, FCBF and MIFS take the redundant features into consideration as well.
To further explore the performance of FEAST on handling feature interaction, a well-known
feature selection algorithm INTERACT [12], which is specifically proposed to address feature
interaction, is selected as one benchmark algorithm.
What’s more, since the proposed FEAST is association-rule-based feature selection algorithm, a
latest association-rule-based algorithm FSBAR [19] is selected as well.
Of these algorithms, both CFS and Consist exploit the best-first strategy to search the space of
feature subsets. For the rest algorithms, there are parameters to be preassigned by user. Such as,
for FCBF, a relevance threshold needs to be set to identify all predominant features to the target
concept and remove the rest; for MIFS, the mutual information based threshold should be prede-
fined to identify the relevant features; for FRanker, we need to set a parameter p-value threshold to
detect the relevant features; for INTERACT, there is a parameter, c-contribution threshold, used to
identify the irrelevant features; and for FSBAR, a parameter called cycle number is used to achieve
considerable results of feature selection. The performance of these algorithms is closely related
to the setting of the corresponding parameters. To make a fair comparison with the new proposed
algorithm FEAST, the parameters of these algorithms (including the proposed algorithm FEAST)
are determined by the standard cross-validation strategy. Where CFS, Consist, FCBF are available
in the data mining toolkit WEKA,9 INTERACT and FSBAR are implemented based on WEKA as
well, and MIFS and FRanker are available in the data mining toolkit TANAGRA.10
2. Classification accuracy over the selected feature subset is extensively used as a measure to evaluate
the performance of the feature selection algorithm in feature selection literature. Since the relevant
features of real world data sets are usually not known in advance, we can not directly evaluate a
feature selection algorithm by the selected features.
However, different classification algorithms have different biases, a feature subset selection al-
gorithm may be more suitable for some classification algorithms than others. With this in mind,
five different types of well-known classification algorithms including probability-based Naive
Bayes [57], instance-based IB1 [58], tree-based C4.5 [59], rule-based PART [60] and kernel-based
SVM [61] respectively were selected.
In order to make best use of the data set and get stable results, the classification accuracies of
all classifiers before and after feature selection were obtained via a 5 × 10-fold cross-validation
procedure. That is, for a given data set, each FSS algorithm and each classifier are repeatedly
performed on the data set with 10-fold cross-validation by five times.
For each of the 5 × 10 = 50 runs, the FSS algorithm is used to choose features from the training
data, and both the training and test data are reduced according to the selected features. Then, the
classifier is built on the reduced training data, and its classification accuracy is evaluated on the
reduced testing data. Finally, for each classifier, we obtain 50 classification accuracies for each
9
http://www.cs.waikato.ac.nz/ml/weka/.
10
http://eric.univ-lyon2.fr/ricco/∼tanagra/en/tanagra.html.
Table 4
Features selected by the eight algorithms on the five synthetic data sets
FSS algorithm synData1 synData2 MONK1 MONK2 MONK3
CFS a0 , _, a5 , a6 , a8 a0 , a1 , a5 , _, a7 , _, r _, _, a5 _, _, _, _, a5 , _ a2 , _, _
FCBF a0 , _, a5 , a6 , a8 a0 , a1 , a5 , _, a7 , _ _, _, a5 a1 , a2 , a3 , a4 , a5 , a6 a2 , a4 , a5
Consist a0 , a1 , a5 , a6 , a8 a1 , a5 , a6 , a8 a1 , a2 , a5 _, _, _, _, _, _ a2 , a4 , a5
FRanker a0 , _, _, _, a8 a1 , a2 , a5 , r, _, _ a1 , a2 , a3 , a4 , _, _, _, _, _, _ a1 , a2 , a3 , _,
a5 , a6 a5 , a6
MIFS a0 , _, a5 , _, a8 , a9 _, a5 , r, a6 , a7 , _ _, _, a5 _, a2 , _, a4 , a5 , a6 a2 , a4 , a5
INTERACT a0 , a1 , a5 , a6 , a8 a1 , a3 , a4 , a5 , a6 , a7 , _ a1 , a2 , a5 a1 , a2 , a3 , a4 , a5 , a6 a2 , a4 , a5
FSBAR a0 , a1 , a3 , a5 , a6 , a8 a0 , a1 , a5 , a6 , a8 _, _, a5 a1 , _, _, _, _, _ a2 , _, a5
FEAST a0 , a1 , a5 , a6 , a8 a1 , a5 , a6 , a8 a1 , a2 , a5 a1 , a2 , a3 , a4 , a5 , a6 a2 , a4 , a5
Relevant features a0 , a1 , a5 , a6 , a8 a1 , a5 , a6 , a8 a1 , a2 , a5 a1 , a2 , a3 , a4 , a5 , a6 a2 , a4 , a5
∗
In Table 4, ‘_’ represents a missing relevant feature, and the letter in bold type means an irrelevant or a redundant feature
selected by mistake. The last row “Relevant features” reports the actual relevant features of each synthetic data set.
feature selection algorithm on each data set. By averaging these accuracies, we get the estimation
of the classification accuracy of each classifier under each feature selection algorithm for each data
set.
3. To evaluate the performance of the proposed algorithm FEAST, we compare FEAST with seven
other feature subset selection algorithms in three aspects: the number of selected features, the
classification accuracies as well as Win/Draw/Loss counts before and after feature subset selection,
and the runtime of the feature subset selection algorithm.
5.2.2. Results and analysis

1. Results on synthetic data sets
Table 4 shows the feature subsets selected by the eight feature selection algorithms on the five synthetic
data sets.
From Table 4, we observe that: i) Only the new proposed algorithm FEAST selects the relevant fea-
tures while eliminating the irrelevant features for all the five synthetic data sets. ii) Of these eight algo-
rithms, CFS, FRanker and MIFS can not detect the redundant feature r of the data set “synData2”. iii)
By comparing the features selected by different algorithms with the target concept on each synthetic data
set, only FEAST reserves all the interactive features for all the five data sets. INTERACT works well
on all the data sets except for the data set “synData2”. The other algorithms detect partial interactive
features of some data sets, or identify all interactive features on some but not all the data sets.
2. Results on real world data sets

In this part, we present the comparison results of FEAST with the other seven feature subset selection
algorithms in terms of i) the number of selected features; ii) the classification accuracies after feature
subset selection; and iii) the runtime.
1). Number of selected features

The reduction on the number of features is an important metric used to evaluate feature subset selection
algorithms. Table 5 presents the number of features selected by each of the eight feature subset selection
algorithms and the number of original features. From this table we observe that i) all feature subset
selection algorithms could significantly reduce the number of features on most data sets. ii) INTERACT
obtains the best reduction rate, while FRanker ranks last with the average number of selected features of
308.8. iii) The average number of selected features of FEAST is 19.66, which is less than that obtained
Table 5
Number of selected features for each of the eight feature selection algorithms (Orig: the original data set, NA: not available.)
Data set FEAST CFS Consist FCBF FRanker MIFS INTERACT FSBAR Orig
Austra 13 7 13 7 11 5 13 9 14
Autos 6 5 6 4 18 2 6 12 22
Cleve 9 6 10 6 9 7 10 10 12
Colic-orig 5 2 10 2 13 7 6 7 23
Colic 3 5 7 5 4 4 9 10 22
Credit-a 13 7 13 7 12 5 13 9 15
Credit-g 10 5 13 4 8 2 13 10 15
Cylinder 7 6 11 2 17 1 11 14 20
Flags 7 5 8 4 10 8 7 15 26
Heart-c 9 6 10 6 9 6 10 10 11
Hepatitis 13 9 11 6 9 7 10 11 16
Ionosphere 9 13 7 5 33 4 8 14 33
Labor 9 7 4 6 6 5 5 7 14
Letter 12 11 13 11 15 11 10 Na 15
Lymph 8 10 9 8 13 11 7 10 18
Mfeat-pixel 117 103 10 27 239 224 16 Na 240
Molecular 13 6 4 6 6 7 4 7 57
Mushroom 14 4 5 4 21 5 5 8 22
Primary-tumor 9 12 16 11 12 13 16 9 17
Segment 13 8 9 6 18 11 9 6 18
Solar-flare1 2 3 10 3 5 2 10 9 12
Sonar 17 19 14 10 21 12 12 10 21
Spectrometer 40 20 15 7 94 94 16 16 94
Splice 19 22 10 22 53 28 11 7 60
Vehicle 11 9 17 4 18 4 17 11 18
Vote 2 4 12 4 14 6 9 7 16
Vowel 11 7 9 7 13 9 9 Na 13
Waveform 12 15 12 6 19 10 13 5 19
Wine 9 11 5 10 13 11 5 8 13
Zoo 13 10 5 8 13 14 5 4 16
CLL-SUB-111 17 NA 8 87 2851 NA 17 NA 11340
GLI-85 75 191 4 134 3185 NA 6 NA 22283
SMK-CAN-187 92 101 10 49 1815 NA 16 NA 19993
TOX-171 38 129 10 80 1537 NA 14 NA 5748
warpAR10P 31 46 6 25 674 NA 7 NA 2400
Average 19.66 24.24 10.49 16.94 308.8 17.83 10.14 9.44 1790.74
by CFS and FRanker. Although the numbers of selected features of MIFS and FSBAR are less than that
of FEAST, these two algorithms are not available over the latter five high dimensional data sets due to
their high consumption.
2). Classification accuracy comparison

For each of the 35 data sets, we obtain the accuracies of each classifier before and after the feature
selection. The results of the five different types of classifiers, namely, Naive Bayes, IB1, C4.5, PART
and SVM are shown in Tables 6, 7, 8, 9 and 10, respectively.
Each table contains the accuracies of the corresponding classifier on the 35 data sets with different
feature subset selection algorithms, along with the Win/Draw/Loss record which represents the number
of data sets where the classification accuracy of the given classifier obtained through FEAST is greater
than/equal to/smaller than that with the compared feature selection algorithm.
From Table 6 we observe that i) compared to the original data set, the average accuracy of Naive Bayes
is improved by only three out of eight algorithms, i.e., FEAST, CFS and FCBF; ii) FEAST outperforms
Table 6
Accuracy of Naive Bayes with the different feature selection algorithms (Orig: the original data set, NA: not available.)
Austra 87.68 85.51 87.25 87.1 85.51 84.49 87.68 86.52 85.22
Autos 71.24 77.4 78.48 69.21 71.19 68.26 78.54 59.51 71.64
Cleve 84.22 84.86 83.5 84.86 84.18 82.86 83.5 82.51 83.85
Colic-orig 83.95 81.52 83.12 84.25 80.97 84.77 77.99 68.21 79.89
Colic 72.02 81.54 76.38 80.73 79.36 79.36 83.15 83.15 70.4
Credit-a 87.25 85.51 87.25 86.38 86.38 84.79 87.25 86.67 86.52
Credit-g 76.1 74.9 74.7 74.6 75 73 75.4 74.8 75.8
Cylinder 71.3 66.67 73.15 56.67 71.3 61.85 73.15 65.93 71.67
Flags 79.89 73.13 77.79 75.18 71.21 79.39 75.26 70.1 73.21
Heart-c 83.46 84.43 83.46 84.43 84.44 85.11 83.5 82.18 84.44
Hepatitis 87.04 87.67 85.83 86.29 87.08 87.08 86.45 84.52 85.13
Ionosphere 93.16 96.31 90.88 90.32 90.6 89.77 92.59 89.74 90.6
Labor 90 89.33 83.67 95 91.67 85.67 85.96 89.47 91.67
Letter 74.48 73.03 74.59 74.52 74.04 74.39 73.42 Na 74.04
Lymph 83.62 81.67 80.9 80.24 84.29 82.38 81.08 83.78 83.67
Mfeat-pixel 92.55 93.3 81.1 91.15 93.3 93.3 80.05 NA 93.3
Molecular 97.18 93.27 93.27 95.27 94.27 91.36 91.51 92.45 90.27
Mushroom 95.59 98.52 98.52 98.52 95.83 96.32 98.94 98.92 95.83
Primary-tumor 47.76 45.71 50.43 46 48.37 48.37 50.44 43.95 50.13
Segment 92.42 93.2 93.2 93.85 91.51 89.52 93.64 93.03 91.51
Solar-flare1 72.44 72.44 65.62 71.8 72.41 52.01 65.63 69.04 65.61
Sonar 84.69 84.19 82.24 83.64 85.62 82.28 79.33 80.77 85.62
Spectrometer 55.56 55.93 53.48 48.21 52.73 52.73 54.05 45.39 52.73
Splice 96.24 92.48 94.33 96.14 95.61 95.74 94.58 91.85 95.36
Vehicle 64.07 62.41 64.07 56.99 62.65 59.11 64.07 64.07 62.65
Vote 95.64 95.64 91.29 96.09 89.92 94.26 93.33 91.26 90.14
Vowel 66.87 56.26 65.86 66.47 67.07 68.89 65.86 Na 67.07
Waveform 81.58 81.08 81 78.04 80.74 79.34 81.6 77.62 80.74
Wine 99.44 97.78 99.44 99.44 98.89 98.89 95.51 99.44 98.89
Zoo 94.09 94.09 93.18 91.18 97 97 93.07 82.18 93.18
CLL-SUB-111 94.59 NA 93.6937 98.20 88.2883 NA 88.29 NA 88.29
GLI-85 100.00 100.00 96.4706 100.00 95.2941 NA 97.65 NA 96.47
SMK-CAN-187 91.44 94.12 80.2139 91.98 71.66 NA 81.28 NA 71.66
TOX-171 95.32 99.42 80.117 99.42 89.47 NA 71.93 NA 89.47
warpAR10P 93.08 95.38 73.0769 90.77 90.00 NA 80.00 NA 90.00
Average 83.88 83.20 81.47 82.94 82.22 80.08 81.31 79.15 81.62
Win/Draw/Loss − 17/4/13 24/4/7 21/2/12 23/1/11 21/0/9 24/3/8 21/2/4 25/0/10
all the other algorithms in terms of average accuracy. It outperforms CFS by 0.82%, Consist by 2.96%,
FCBF by 1.13%, FRanker by 2.02%, MIFS by 4.75%, INTERACT by 3.16% and FSBAR by 5.98%;
iii) FEAST outperforms the other algorithms in terms of Win/Draw/Loss record, as it wins the other
algorithms for 21.57 out of 35 data sets on average.
From Table 7 we observe that i) compared to the original data set, the average accuracy of IB1 is
improved by FEAST, CFS and FRanker; ii) FEAST is the best in terms of the average accuracy im-
provement on IB1. It outperforms CFS by 1.15%, Consist by 2.61%, FCBF by 2.71%, FRanker by
0.94%, MIFS by 6.14%, INTERACT by 4.69% and FSBAR by 6.4%; iii) FEAST outperforms the other
algorithms in terms of Win/Draw/Loss record, as it wins the other algorithms for 19.63 out of 35 data
sets on average.
From Table 8 we observe that i) compared to the original data set, the average accuracy of C4.5 is
improved by FEAST, Consist and FRanker; ii) FEAST outperforms all the other algorithms in terms
of average accuracy. It outperforms CFS by 2.83%, Consist by 2.25%, FCBF by 3.41%, FRanker by
Table 7
Accuracy of IB1 with the different feature selection algorithms (Orig: the original data set, NA: not available.)
Austra 83.04 85.51 83.48 84.49 82.75 86.09 81.01 84.49 83.04
Autos 84.29 87.76 90.17 69.74 88.24 65.83 89.27 80 88.24
Cleve 82.87 79.88 80.55 79.88 82.83 80.56 80.2 80.86 78.23
Colic-orig 85.84 81.52 85.32 82.6 83.69 80.16 67.93 77.17 80.98
Colic 79.05 82.3 77.19 79.05 80.41 80.41 79.08 85.33 71.46
Credit-a 81.89 85.51 81.89 83.91 83.48 85.22 79.71 84.35 81.02
Credit-g 72.3 74.4 72.9 74.6 72.9 73 69.7 71.8 72.5
Cylinder 75 65.74 72.78 57.78 72.04 61.48 70.74 72.96 75.37
Flags 69.05 76.26 70 71.08 63.89 68.03 68.04 58.25 54.74
Heart-c 82.15 79.83 81.48 79.83 82.16 79.82 77.89 81.19 79.51
Hepatitis 83.04 84.5 83.29 85.67 85.13 87.08 83.87 82.58 83.25
Ionosphere 93.17 96.87 90.62 88.02 93.74 88.34 93.45 91.45 93.74
Labor 88.33 87.67 84 93 91.33 80.67 85.96 91.23 91.67
Letter 91.81 89.8 92.19 91.23 91.7 91.73 89.92 Na 91.81
Lymph 82.43 83 81.05 80.95 81.62 80.38 79.05 85.14 83.67
Mfeat-pixel 96.15 95.45 80.85 92.8 96.15 96.15 79.2 Na 96.15
Molecular 89.55 87.82 87.64 92.36 92.27 85.91 86.79 82.08 82.27
Mushroom 100 98.52 100 99.02 100 100 100 100 100
Primary-tumor 43.66 40.09 39.5 43.33 39.21 40.39 33.33 43.36 39.21
Segment 94.85 95.85 94.68 95.37 94.68 93.99 95.15 95.19 94.68
Solar-flare1 72.48 72.48 68.15 70.6 71.23 52.64 66.56 69.35 68.15
Sonar 78.4 78.9 83.17 78.81 79.86 78.43 81.73 74.04 79.86
Spectrometer 62.15 61.58 58.95 48.21 59.89 59.89 55.37 44.26 59.89
Splice 81.95 89.31 86.55 81.16 75.36 79.22 84.29 88.34 74.67
Vehicle 70.69 71.4 73.89 56.87 72.83 60.77 71.87 70.57 72.83
Vote 95.64 95.64 93.35 95.18 94.73 96.33 94.25 94.94 92.44
Vowel 90.2 56.87 90 79.7 90.2 84.55 90.1 Na 90.2
Waveform 76 75.48 71.82 73.72 74.68 73.64 72.72 74.1 74.68
Wine 97.78 96.63 98.89 98.89 98.33 97.78 97.19 98.31 98.33
Zoo 97.09 96.09 93.18 93.09 97.09 97.09 94.06 80.2 96.18
CLL-SUB-111 95.50 NA 89.19 99.10 85.59 NA 81.98 NA 84.68
GLI-85 98.82 98.82 96.47 100.00 97.65 NA 92.94 NA 97.65
SMK-CAN-187 88.77 88.77 83.42 81.82 79.14 NA 80.75 NA 79.14
TOX-171 94.74 98.25 81.29 99.42 95.32 NA 78.36 NA 95.32
warpAR10P 95.38 98.46 80.77 94.62 96.15 NA 89.23 NA 96.15
Average 84.40 83.44 82.25 82.17 83.61 79.52 80.62 79.32 82.33
Win/Draw/Loss − 15/4/15 23/2/10 21/1/13 17/4/14 19/7/4 26/1/8 18/1/8 18/5/12
2.18%, MIFS by 5.78%, INTERACT by 2.6% and FSBAR by 3.57%; iii) FEAST outperforms the other
algorithms in terms of Win/Draw/Loss record, as it wins the other algorithms for 21.43 out of 35 data
sets on average.
From Table 9 we observe that i) compared to the original data set, the average accuracy of PART
is improved by all the algorithms except for MIFS and FSBAR; ii) FEAST is the best in terms of the
average accuracy improvement on PART. It outperforms CFS by 2.89%, Consist by 3.15%, FCBF by
3.73%, FRanker by 3.14%, MIFS by 6.88%, INTERACT by 3.77 and FSBAR by 4.35%; iii) FEAST
outperforms the other algorithms in terms of Win/Draw/Loss record, as it wins the other algorithms for
24 out of 35 data sets on average.
From Table 10 we observe that i) compared to the original data set, the average accuracy of SVM is im-
proved by only our algorithm FEAST; ii) FEAST outperforms CFS by 2.12%, Consist by 8.69%, FCBF
by 3.4%, FRanker by 1%, MIFS by 5.87%, INTERACT by 4.18 and FSBAR by 4.62%; iii) FEAST
outperforms the other algorithms in terms of Win/Draw/Loss record, as it wins the other algorithms for
24.71 out of 35 data sets on average.
Table 8
Accuracy of C4.5 with the different feature selection algorithms (Orig: the original data set, NA: not available.)
Austra 86.38 85.51 86.52 86.52 87.1 84.2 86.38 87.1 86.52
Autos 76.57 75.55 77.47 67.31 78.02 68.26 77.56 73.17 83.81
Cleve 79.25 79.2 77.93 79.2 78.27 78.9 78.55 78.22 78.6
Colic-orig 85.84 81.52 85.3 81.52 85.03 85.04 66.3 65.22 85.03
Colic 66.31 66.31 66.31 66.31 66.31 66.31 85.6 85.33 66.31
Credit-a 86.96 85.51 86.96 85.94 85.51 84.2 86.96 85.94 87.25
Credit-g 73.9 74.5 73.1 72.4 72.8 73 73 73 72.1
Cylinder 74.82 58.89 63.33 57.78 62.59 57.78 63.33 75.37 62.04
Flags 71.74 72.24 71.66 71.63 74.26 71.74 72.68 69.59 71.18
Heart-c 79.8 79.8 80.79 79.8 79.13 79.81 80.86 80.86 78.79
Hepatitis 84.46 83.13 83.21 84.38 86.37 84.42 83.23 80 82.63
Ionosphere 90.32 100 90.32 88.06 89.17 86.64 91.74 88.89 89.17
Labor 84.33 80.67 78.67 80.67 80.67 67.67 77.19 85.96 80.67
Letter 78.88 79.17 78.85 79.14 78.62 78.98 78.84 Na 78.62
Lymph 77.62 75.72 78.29 70.81 77.71 78.29 74.32 74.32 78.33
Mfeat-pixel 78.75 79.6 71.6 78 78.65 78.65 74.6 Na 78.65
Molecular 80.91 83.82 84.91 82.82 82.82 84.82 83.96 83.96 80.82
Mushroom 100 98.52 100 99.02 100 100 100 100 100
Primary-tumor 43.65 41.56 40.09 42.47 39.81 38.04 39.53 42.18 39.8
Segment 95.59 95.98 95.76 96.02 95.33 95.89 95.41 94.94 95.33
Solar-flare1 72.17 72.17 72.16 72.48 72.15 52.64 71.83 73.07 72.16
Sonar 88 80.31 83.21 76.91 79.81 76.45 81.25 74.04 79.81
Spectrometer 54.24 51.6 56.5 50.09 49.34 49.34 54.43 45.2 49.34
Splice 94.54 92.7 93.82 94.48 94.36 94.54 94.55 92.95 94.36
Vehicle 73.75 72.82 71.51 56.86 71.99 60.65 71.16 71.87 71.99
Vote 95.64 95.64 96.33 94.95 96.33 96.1 96.32 95.4 96.33
Vowel 81.42 56.47 81.42 75.26 80.91 77.58 81.31 Na 80.91
Waveform 76.92 76.16 75.26 75.56 76.48 76.02 76.16 73.32 76.48
Wine 97.78 94.96 96.05 94.41 93.85 93.85 95.51 94.94 93.85
Zoo 92.09 93.18 93.09 92.09 95.18 95.18 92.08 83.17 92.18
CLL-SUB-111 88.29 NA 88.29 87.39 81.08 NA 75.68 NA 81.08
GLI-85 98.82 91.76 89.41 91.76 84.71 NA 90.59 NA 83.53
SMK-CAN-187 85.03 77.54 81.28 80.75 82.89 NA 79.14 NA 82.89
TOX-171 75.44 84.21 76.02 83.04 79.53 NA 73.68 NA 79.53
warpAR10P 86.92 82.31 68.46 86.92 79.23 NA 80.77 NA 79.23
Average 81.63 79.38 79.83 78.94 79.89 77.17 79.56 78.82 79.69
Win/Draw/Loss − 21/4/9 19/6/10 25/4/6 24/2/9 19/4/7 23/3/9 19/1/7 26/2/7
To summarize, for all the five classifiers, FEAST outperforms the other feature subset selection algo-
rithms not only in terms of average accuracy but also in terms of Win/Draw/Loss record. Hence, it is
more effective than the other seven feature subset selection algorithms in terms of classification accuracy
improvement.
We also noticed that, compared with CFS, our algorithm FEAST is slightly better in improving the
classification accuracies of Naive Bayes. This is due to independent variables are assumed by Naive
Bayes, thus interactive features pose negative impact on it [63].
3). Runtime comparison

Table 11 records the runtime of each feature subset selection algorithm upon the 35 data sets. From
it we observe that i) The average runtime of different algorithms varies greatly, FRanker ranks 1 with
379 ms, and INTERACT ranks last with 105882 ms. ii) FEAST is faster than CFS, Consist, INTERACT
and FSBAR. Although the algorithms MIFS and FSBAR have relatively shorter runtime, they are not
available for high dimensional data sets because of their high consumption. Meanwhile, compared with
Table 9
Accuracy of PART with the different feature selection algorithms (Orig: the original data set, NA: not available.)
Austra 86.38 85.51 85.07 85.07 86.81 85.07 86.52 86.23 85.8
Autos 79.47 79.45 77.88 67.28 77 67.76 78.05 74.63 78
Cleve 83.19 80.58 80.88 80.58 78.92 76.92 80.53 78.88 79.57
Colic-orig 85.84 81.52 85.58 81.24 85.04 85.85 67.66 71.2 83.96
Colic 67.36 67.39 67.41 66.31 69.01 69.01 83.7 84.24 64.11
Credit-a 84.93 85.51 84.93 85.65 84.79 84.78 85.51 85.07 86.38
Credit-g 74.5 74.3 72.7 75 73.9 73 72.4 71.6 71.9
Cylinder 74.63 62.04 61.3 57.78 66.48 57.78 61.3 73.89 65
Flags 70.16 70.58 65.95 72.1 67.6 68.13 69.59 67.53 64.92
Heart-c 82.13 81.12 81.13 81.12 82.12 80.16 81.19 82.84 78.85
Hepatitis 85.71 83.13 85.08 84.38 81.29 83.13 83.23 78.71 82.58
Ionosphere 90.6 100 91.75 87.49 90.03 87.2 92.02 89.17 90.03
Labor 88 80.67 78.67 85.67 85.67 73 77.19 85.96 80.67
Letter 81.45 81.4 81.04 80.9 80.69 81.15 81.41 Na 80.69
Lymph 81.71 77.14 80.33 78.91 78.95 72.29 78.38 75 79.67
Mfeat-pixel 84.55 82.9 73.35 85.8 82 82 72.65 Na 82
Molecular 85.82 86.73 91.55 84.82 84.82 85.91 89.62 86.79 82.82
Mushroom 100 98.52 99.98 99.02 100 100 100 100 100
Primary-tumor 43.35 45.39 40.41 40.12 41.31 39.49 39.82 43.07 40.7
Segment 94.76 95.15 94.29 94.63 94.42 94.76 95.32 94.03 94.42
Solar-flare1 71.51 71.51 69.38 71.52 71.82 52.64 69.35 73.68 69.68
Sonar 85.62 80.81 83.74 78.83 81.29 77.47 82.69 75 81.29
Spectrometer 52.17 52.73 53.3 49.53 48.59 48.59 53.11 42.75 48.59
Splice 92.76 92.07 93.13 93.39 92.82 93.04 92.85 92.57 92.51
Vehicle 69.73 70.92 71.03 56.63 69.73 59.59 70.21 72.46 69.73
Vote 95.64 95.64 95.63 94.95 95.65 96.1 96.32 95.17 94.71
Vowel 78.08 55.56 77.78 72.83 77.38 76.16 77.58 Na 77.38
Waveform 78.08 78.16 75.32 75.36 78.08 76.02 78.08 75.2 78.08
Wine 96.67 96.11 96.6 95 93.85 93.85 91.57 96.07 93.85
Zoo 95.09 93.18 93.09 92.09 95.18 95.18 92.08 83.17 92.18
CLL-SUB-111 93.69 NA 90.991 86.49 81.0811 NA 82.88 NA 82.88
GLI-85 100.00 91.7647 94.1176 91.76 84.7059 NA 89.41 NA 83.53
SMK-CAN-187 86.10 82.352 80.2139 84.49 80.21 NA 80.21 NA 80.21
TOX-171 81.29 86.5497 78.3626 84.80 81.87 NA 74.27 NA 81.87
warpAR10P 86.92 80 67.6923 82.31 76.92 NA 76.15 NA 76.92
Average 82.51 80.19 79.99 79.54 80.00 77.20 79.51 79.07 79.30
Win/Draw/Loss − 21/2/11 28/1/6 28/0/7 25/3/7 22/2/6 23/2/10 21/1/6 30/3/2
associative-based algorithm FSBAR, FEAST is much more efficient since it generates association rules
by FP-growth algorithm which is more efficient than the Apriori algorithm used in FSBAR.
5.2.3. Sensitive analysis of the support and confidence thresholds

1. Sensitive analysis upon number of selected features
Figure 3 shows sensitivity analysis results of the support and confidence thresholds on the number of
the selected features for the proposed FEAST algorithm.
From Fig. 3(a) we observe that for all the 35 data sets, with the increment of the support threshold, the
number of the selected features decreases. The reason is that with the increment of the support threshold,
the number of the frequent itemsets decreases. At the same time, FEAST chooses feature subset from
itemsets that are frequent at least, thus the number of the selected features deceases. We also observe that
although the number of the selected features deceases with the increment of the support threshold, for
the different data sets, the decrement extents are varying. Therefore, we should choose different support
thresholds for different data sets.
Table 10
Accuracy of SVM with the different feature selection algorithms (Orig: the original data set, NA: not available.)
Austra 85.79 85.39 84.99 85.39 85.04 85.51 84.96 85.16 84.93
Autos 86.34 81.27 82.24 77.46 86.34 67.80 88.59 87.78 85.81
Cleve 84.82 84.55 83.56 84.62 84.49 85.28 83.70 84.75 83.83
Colic-orig 77.72 68.97 80.76 68.97 72.83 72.83 71.36 81.31 78.32
Colic 83.97 82.23 82.72 82.39 83.37 81.85 83.64 82.65 82.71
Credit-a 85.80 85.74 85.30 85.74 85.62 85.51 85.33 85.28 85.42
Credit-g 76.30 71.98 73.84 71.54 75.30 71.70 75.50 75.24 75.80
Cylinder 73.70 66.85 73.56 63.41 72.56 63.22 73.33 71.81 73.15
Flags 75.77 74.33 68.87 71.65 71.13 71.03 72.99 69.35 69.62
Heart-c 84.82 84.88 83.63 84.75 84.36 84.95 83.63 84.48 84.35
Hepatitis 88.39 85.16 86.32 84.90 84.00 81.42 85.68 84.38 87.99
Ionosphere 95.16 94.25 88.72 93.39 91.28 94.13 93.85 92.25 91.28
Labor 87.72 90.88 73.33 89.47 85.96 87.02 82.81 88.80 91.00
Letter 75.20 76.80 76.75 76.80 77.10 77.10 77.20 Na 77.10
Lymph 85.81 82.70 82.30 78.92 83.24 80.27 81.49 84.09 86.18
Mfeat-pixel 97.25 96.68 47.82 93.42 97.53 97.49 83.24 Na 97.56
Molecular 94.34 92.64 53.02 92.64 92.64 91.51 90.57 90.96 92.05
Mushroom 100.00 99.02 99.99 99.02 100.00 99.90 99.90 99.99 100.00
Primary-tumor 45.72 45.78 46.61 45.37 43.13 45.55 46.67 44.60 46.37
Segment 96.15 96.38 96.40 96.51 95.96 95.91 96.39 86.58 95.96
Solar-flare1 73.37 73.50 71.15 73.50 73.44 73.75 71.27 72.81 71.07
Sonar 86.06 85.19 85.00 83.17 86.15 81.15 83.27 84.28 86.10
Spectrometer 62.15 61.54 50.81 53.26 63.50 63.50 60.19 59.70 63.51
Splice 96.24 95.95 51.62 95.96 93.56 95.59 94.61 93.97 92.84
Vehicle 72.34 68.56 72.48 56.57 71.91 56.34 72.41 70.71 71.85
Vote 95.63 95.63 91.13 95.63 95.40 95.63 95.54 95.63 95.72
Vowel 84.95 72.89 82.67 72.85 84.38 77.96 82.63 Na 84.38
Waveform 85.22 85.78 82.98 79.70 85.90 81.14 85.18 78.97 85.99
Wine 98.31 98.20 97.30 98.76 97.98 97.98 95.28 98.66 97.99
Zoo 96.04 96.24 96.04 95.25 96.04 96.04 95.45 87.18 96.04
CLL-SUB-111 96.40 NA 89.19 100.00 97.30 NA 85.59 NA 97.30
GLI-85 98.83 91.76 94.12 100.00 97.65 NA 92.94 NA 97.65
SMK-CAN-187 91.98 94.65 86.63 90.37 89.84 NA 74.33 NA 89.84
TOX-171 98.83 100.00 88.30 98.83 99.42 NA 75.44 NA 99.42
warpAR10P 95.38 99.23 81.54 93.08 98.46 NA 86.92 NA 98.46
Average 86.07 84.28 79.19 83.24 85.22 81.30 82.62 82.27 85.64
Win/Draw/Loss − 21/1/11 29/1/5 26/2/7 23/3/9 22/2/6 30/0/5 22/1/4 20/2/13
From Fig. 3(b) we observe that with the increment of the confidence threshold, the number of selected
features either increases or decreases. The reason is that for a given confidence threshold, there are many
support thresholds with varying values. Further, for the different confidence thresholds, the varying
ranges of the support thresholds are different. This means the corresponding numbers of the frequent
itemsets and further the numbers of selected features are different as well. This reveals that both the
support and confidence thresholds are affected by data set characteristics and we should select different
thresholds for different data sets.
2. Sensitive analysis upon classification accuracy

Figure 4 shows sensitivity analysis results of the support and confidence thresholds on the classifica-
tion accuracies of the five classifiers with the FEAST algorithm.
From Figs 4(a) and 4(b) we observe that i) for a given data set, the classification accuracy varying
trends of the five classifiers w.r.t FEAST are very similar for either the given support thresholds or the
Table 11
Runtime (in ms) for the eight feature selection algorithms (NA: not available.)
Data set FEAST CFS Consist FCBF FRanker MIFS INTERACT FSBAR
Austra 45 42 125 22 1553 85 928 1432
Autos 2216 43 62 22 64 42 69 29206
Cleve 20 22 31 22 12 105 55 412
Colic-orig 31 22 94 42 15 105 74 582
Colic 13 22 78 22 16 62 71 650
Credit-a 1170 22 94 20 22 85 75 2175
Credit-g 85 12 172 11 89 16 102 1032
Cylinder 2750 42 157 20 38 42 85 55637
Flags 319 22 63 42 21 127 86 2649
Heart-c 20 22 32 20 16 84 55 215
Hepatitis 77 42 32 42 38 107 47 9698
Ionosphere 200 63 125 22 45 85 71 322412
Labor 16 22 31 20 14 42 51 18
Letter 1190 275 20843 42 837 338 5517 Na
Lymph 51 22 31 12 6 85 59 6786
Mfeat-pixel 4250 7097 13122 27 346 126500 4409 Na
Molecular 79 22 47 10 8 105 71 631
Mushroom 223 22 937 22 125 105 404 242803
Primary-tumor 624 22 63 22 28 84 63 228
Segment 110 42 453 15 69 105 150 2175
Solar-flare1 20 20 32 20 14 22 56 190
Sonar 894 42 47 12 17 62 52 51264
Spectrometer 4800 212 1755 7 101 2437 345 58551
Splice 1890 126 4500 42 104 1562 842 57435
Vehicle 970 43 141 22 27 42 77 1097
Vote 16 22 79 20 16 42 70 9120
Vowel 2920 22 109 13 18 63 91 Na
Waveform 770 231 2468 20 151 149 714 1439
Wine 30 20 31 14 15 84 48 532
Zoo 1800 22 16 22 11 85 48 17678
CLL-SUB-111 33080 NA 94846 2637 1672 NA 412016 NA
GLI-85 210710 1296543 261426 4281 2077 NA 1247630 NA
SMK-CAN-187 127018 229392 422736 5850 3752 NA 1852980 NA
TOX-171 271746 255268 52701 2431 1303 NA 160042 NA
warpAR10P 17579 9676 6886 796 631 NA 18408 NA
Average 19649 52928 25268 476 379 4429 105882 32446
given confidence thresholds. This reveals that the proposed feature subset selection algorithm FEAST
has no bias for a special classifier, i.e., the results obtained by FEAST is generally suitable. ii) The
classification accuracy varies with both the support and confidence thresholds, and the thresholds cor-
responding to the highest classification accuracy are different for different data sets. For example, in
Fig. 4(a), the support threshold corresponding to the highest classification accuracy is about 10% for
Data set 7, while less than 5% for Data set 8 and greater than 30% for Data set 26. In Fig. 4(b), the
confidence threshold corresponding to the highest classification accuracy is greater than 95% for Data
set 2, while about 70% for Data set 9 and about 90% for Data set 27. This implies that both support and
confidence thresholds affect the feature subset chosen by FEAST, and the best thresholds are different
for different data sets.
3. Sensitive analysis upon runtime

Figure 5 shows the sensitivity analysis results of the support and confidence thresholds on the runtime
of our proposed FEAST algorithm.
15 10 15 20 20 15 15
10 10 10 10
5 10 10
5 5 5 5
0 0 0 0 0 0 0
20 40 6 8 10 12 14 16 18 10 20 30 5 10 15 20 25 20 40 20 40 10 20 30
Data set 1 Data set 2 Data set 3 Data set 4 Data set 5 Data set 6 Data set 7
10 10 15 15 15 15 14
10 10 10 10 12
5 5
5 5 5 5 10
0 0 0 0 0 0 8
Number of selected features
10 20 8 10 12 10 20 30 20 40 60 40 60 10 20 30 40 50 0.5 1 1.5
15 80 60 20 20 20 15
60
10 40 10
40 10 10 10
5 20 5
20
0 0 0 0 0 0 0
20 40 5 10 15 20 10 20 30 20 40 5 10 2 4 6 8 10 12 14 5 10 15 20
20 20 60 15 15 15 20
40 10 10 10 15
10 10
20 5 5 5 10
0 0 0 0 0 0 5
10 20 30 40 5 10 15 20 5 10 15 20 5 10 10 20 30 40 50 2 4 2 4 6
15 10 200 300 150
200
10 200 150 200 100
5 100
5 100 100 100 50
0 0 0 0 50 0 0
10 20 30 20 40 94 96 98 80 85 90 80 85 90 80 90 70 80 90
Support threshold (%)
(a) Number of features vs. Support threshold

12 15 12 12 15 12 15
10 10 10 10 10
10 10
8 5 8 8 8
6 0 6 6 5 6 5
70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
15 10 12 15 15 8 12
10 8 10 10 10
6 11.5
5 6 8 5 5
0 4 6 0 0 4 11
Number of selected features
70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100

15 60 30 12 15 14 10
40 10 12
10 25 10 5
20 8 10
5 0 20 6 5 8 0
70 80 90 100 60 80 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
20 15 20 15 15 8 20
15 10 15 10
10 7 15
10 5 10 5
5 0 5 5 0 6 10
70 80 90 100 60 80 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
15 10 150 300 250 250 200
100 200 200 200

10 5 100
50 100 150 150
5 0 0 0 100 100 0
70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
Confidence threshold (%)
(b) Number of features vs. Confidence threshold
Fig. 3. Number of the selected features for the FEAST algorithm vs. different thresholds.
Naive Bayes IB1 C4.5 PART SVM
90 100 85 80 90 90 80
85 80 80 85 85 75
70
80 60 75 80 80 70
75 40 70 60 75 75 65
20 40 6 8 10 12 14 16 18 10 20 30 5 10 15 20 25 20 40 20 40 10 20 30
80 80 85 90 100 100 100
70 70 85 90
80 80 80
60 60 80 80
50 50 75 75 70 60 60
10 20 8 10 12 10 20 30 20 40 60 40 60 10 20 30 40 50 0.5 1 1.5
Classification accuracy (%)
90 100 100 105 50 100 75
90 100 40 90 70
80 50
80 95 30 80 65
70 0 70 90 20 70 60
20 40 5 10 15 20 10 20 30 20 40 5 10 2 4 6 8 10 12 14 5 10 15 20
90 60 100 80 100 100 90
80 40 70 80
80 90 50
70 20 60 70
60 0 60 50 80 0 60
10 20 30 40 5 10 15 20 5 10 15 20 5 10 10 20 30 40 50 2 4 2 4 6
100 100 120 120 100 100 100
95 80 100 100 90
50 50
90 60 80 80 80
85 40 60 60 70 0 0
10 20 30 20 40 94 96 98 80 85 90 80 85 90 80 90 70 80 90
(a) Accuracies vs. Support threshold

Naive Bayes IB1 C4.5 PART SVM
88 90 85 75 90 88 76
86 80 85 86 74
80 70
84 70 80 84 72
82 60 75 65 75 82 70
70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
75 80 85 90 95 90 100
70 70 85 90
80 90 80
65 60 80 80
60 50 75 75 85 70 70
70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
Classification accuracy (%)
85 40 100 100 50 95 75
35 90 98
80 40 90 70
30 80 96
75 25 70 94 30 85 65
70 80 90 100 60 80 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
90 30 100 80 100 80 90
85 25 90 70 95 70 80
80 20 80 60 90 60 70
75 15 70 50 85 50 60
70 80 90 100 60 80 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
100 100 82 100 100 90 80
80 90 90 85
95 80 70
78 80 80 80
90 60 76 70 70 75 60
70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
(b) Accuracies vs. Confidence threshold
Fig. 4. Classification accuracies of the five classifiers with FEAST vs. different thresholds. (Colours are visible in the online
version of the article; http://dx.doi.org/10.3233/IDA-130608)
3000
3000 600 4000 3000
4000 4000
400 2000
2000 2000
2000 2000 2000
1000 200 1000 1000
0 0 0 0 0 0 0
20 40 6 8 1012141618 10 20 30 5 10 15 20 25 20 40 20 40 10 20 30
300
4000 15000 400 4000
2000 200 2000
10000
2000 200 2000
5000 1000 100 1000
0 0 0 0 0 0 0
10 20 8 10 12 10 20 30 20 40 60 40 60 1020304050 0.5 1 1.5
6
x 10
Runtime (ms)
3 4000 800
3000 4000
10000 4000 600
2
2000 2000 400
5000 2000
1 2000
1000 200
0 0 0 0 0 0 0
20 40 5 10 15 20 10 20 30 20 40 5 10 2 4 6 8 101214 5 10 15 20
5
x 10
6000 3 1500 6000
15000 4000
1000 2000 4000
4000 2
10000
500 2000 1000
2000 1 5000 2000
0 0 0 0 0 0 0
10 20 30 40 5 10 15 20 5 10 15 20 5 10 10 20 30 40 50 2 4 2 4 6
4 4 4 4 4
x 10 x 10 x 10 x 10 x 10
8
3000 10 15
2000 3 10 6
2000 2 10 4
1000 5 5
1000 1 5 2
0 0 0 0 0 0 0
10 20 30 20 40 94 96 98 80 85 90 80 85 90 80 90 70 80 90
(a) Runtime vs. Support threshold

3000 800
400 100 400
200 400 600
2000
200 50 200 400
1000 100 200
200
0 0 0 0 0 0 0
70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
80
1500 4000 200 3000 1500
60 10
1000 40 2000 1000
2000 100 5
500 20 1000 500
0 0 0 0 0 0 0
70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
6
x 10
Runtime (ms)
600 600 2000 100

400 600
4
400 400 400
200 1000 50
2 200 200 200
0 0 0 0 0 0 0
70 80 90 100 60 80 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
5
x 10
3 10000 600
1000 4000
200 400
2 400
500 5000 2000
1 200 100 200
0 0 0 0 0 0 0
70 80 90 100 60 80 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
4 4 4
x 10 x 10 x 10
400 3 3
100 2 6000
4000 2
2
200 4000
50 2000 1
1 1 2000
0 0 0 0 0 0 0
70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100
(b) Runtime vs. Confidence threshold
Fig. 5. Runtime of the feature subset selection algorithm FEAST vs. different thresholds.
From Fig. 5(a) we observe that for all the data sets, the runtime of FEAST decreases when the support
threshold increases. This is because with the increment of the support threshold, the number of the
frequent itemsets decreases. So the time used to mine the frequent itemsets decreases as well. At the
same time, FEAST chooses from itemsets the feature subset that is frequent at least, thus the time
consumed in the feature subset identification also deceases.
From Fig. 5(b) we observe that the runtime of FEAST can increase, decrease and fluctuate when
the confidence threshold increases. The reason is that for a given confidence threshold, there are many
support thresholds with varying values. Further, for the different confidence thresholds, the varying
ranges of the support thresholds are different. This means that the corresponding numbers of the frequent
itemsets and further the numbers of selected features are different as well. Thus, the time used to mine
frequent itemsets and to identify feature subset is varying.
To summarize, the performance of the proposed algorithm FEAST is directly affected by the selection
of these two input-parameters: support and confidence thresholds. However, the appropriate thresholds
for different data sets would be different. That is, there are no specific support and confidence thresholds
which are the best choice for all the data sets. We should pick up different thresholds for different data
sets.
Procedure Threshold Prediction&Validation

Input:
TIMES = 5, FOLDS = 10, DATAS = 35;
METRICS = {M1 , M2 , · · · , MDATAS };
CLASSIFIERS = {NaiveBayes, IB1, C4.5, PART, SVM}.
1: for i = 1 to DATAS do
2: DataSetClusters = Clustering(METRICS - {Mi });
3: ClusterID = DataSetClassification(Mi , DataSetClusters);
4: MetricsSet = the metrics of the data sets in cluster ClusterID;
5: [SuppSet, ConfSet] = the appropriate support & confidence thresholds for the data sets in cluster ClusterID;
6: [SuppModel, ConfModel] = PLSR(MatricsSet, SuppSet, ConfSet);
7: [SuppThreshold, ConfThreshold] = apply SuppModel & ConfModel to Mi ;
8: for each classifier ∈ CLASSIFIERS do
9: for k = 1 to TIMES do
10: randomize instance-order for Di and generate FOLDS bins from Di ;
11: for j = 1 to FOLDS do
12: TestData = bin[j], TrainData = Di - TestData;
13: FeatureSubset = apply FEAST to data set TrainData with SuppThreshold & ConfThreshold;
14: [TrainData’, testData’] = reduce TrainData and TestData with FeatureSubset;
15: predictor = apply classifier to TrainData’;
16: Accuracy[k][j] = apply predictor to TestData’;
17: end for
18: end for
FOLDS
TIMES
19: meanAccDi ,classifier = TIMES·1FOLDS Accuracy[k][j];
k=1 j=1
20: end for
21: end for
5.3. Experimental results of support and confidence threshold prediction
5.3.1. Experimental process

Firstly, the metrics introduced in 4.2 is extracted for each data set Di (i = 1, 2, · · · , 35), resulting in the
metrics set METRICS = {M1 , M2 , · · · , M35 } for all the 35 data sets. At the same time, the most suitable
Table 12
Comparison of the accuracies of the five different classifiers with the most suitable and the predicted support and confidence
thresholds
Data set Naive Bayes IB1 C4.5 PART SVM
acc accPre dec acc accPre dec acc accPre dec acc accPre dec acc accPre dec
(%) (%) (%) (%) (%)
Austra 87.68 87.10 0.66 83.04 83.04 0.00 86.38 86.38 0.00 86.38 86.38 0.00 85.79 85.79 0.00
Autos 71.24 71.22 0.03 84.29 84.19 0.12 76.57 76.56 0.02 79.47 79.41 0.07 86.34 85.27 1.25
Cleve 84.22 82.18 2.42 82.83 80.53 2.78 79.25 79.21 0.05 83.19 77.89 6.37 84.82 82.18 3.11
Colic 83.95 80.43 4.19 85.84 81.52 5.03 85.84 85.33 0.60 85.84 85.33 0.60 83.97 81.52 2.91
Colic-orig 72.02 72.02 0.00 79.05 79.05 0.00 66.31 66.31 0.00 67.36 67.36 0.00 77.72 77.72 0.00
Credit-a 87.25 87.25 0.00 81.89 81.89 0.00 86.96 86.96 0.00 84.93 84.93 0.00 85.80 85.80 0.00
Credit-g 76.10 75.40 0.92 72.30 71.20 1.52 73.90 71.50 3.25 74.50 71.90 3.49 76.30 73.30 3.93
Cylinder 71.30 71.30 0.00 75.00 75.00 0.00 74.82 74.82 0.00 74.63 74.63 0.00 73.70 72.04 2.26
Flags 79.82 79.82 0.00 69.05 69.05 0.00 71.74 71.74 0.00 70.16 70.16 0.00 75.77 70.13 7.45
Heart-c 83.46 82.84 0.74 82.15 80.86 1.57 79.80 79.21 0.74 82.18 80.20 2.41 84.82 82.50 2.74
Hepatitis 87.04 87.04 0.00 83.04 83.04 0.00 84.46 84.46 0.00 85.71 85.71 0.00 88.39 88.39 0.00
Ionosphere 93.16 86.45 7.20 93.17 83.23 10.67 90.32 85.81 5.00 90.60 85.81 5.29 95.16 86.45 9.15
Labor 90.00 90.00 0.00 88.33 88.33 0.00 84.33 84.33 0.00 88.00 88.00 0.00 87.72 87.72 0.00
Letter 74.48 74.48 0.00 91.70 91.70 0.00 78.88 78.88 0.00 81.45 81.45 0.00 75.20 75.20 0.00
Lymph 83.62 83.62 0.00 82.43 82.43 0.00 77.62 77.62 0.00 81.71 81.71 0.00 85.81 84.38 1.67
Mfeat-pixel 92.55 86.75 6.27 96.15 77.70 19.19 78.75 77.80 1.21 84.55 83.20 1.60 97.25 97.25 0.00
Molecular 97.18 94.34 2.92 89.55 81.13 9.40 80.91 79.25 2.06 85.82 85.75 0.08 94.34 89.62 5.00
Mushroom 95.59 94.68 0.95 100 100 0.00 100 100 0.00 100 100 0.00 100 98.19 1.81
Primary-tumor 47.76 47.38 0.80 43.66 38.64 11.49 43.65 40.41 7.42 43.35 41.00 5.41 45.72 43.65 4.53
Segment 92.42 92.42 0.00 94.85 94.85 0.00 95.59 95.59 0.00 94.76 94.76 0.00 96.15 96.15 0.00
Solar-flare1 72.44 72.44 0.00 72.48 72.48 0.00 72.17 72.17 0.00 71.51 71.51 0.00 73.37 73.37 0.01
Sonar 84.62 84.62 0.00 78.40 78.40 0.00 88.00 88.00 0.00 85.62 85.62 0.00 86.06 84.21 2.15
Spectrometer 55.56 55.56 0.00 62.15 62.15 0.00 54.24 54.24 0.00 52.17 52.17 0.00 62.15 62.15 0.00
Splice 96.24 95.30 0.98 81.94 74.01 9.68 94.54 94.36 0.19 92.76 92.38 0.41 96.24 94.76 1.53
Vehicle 64.07 62.65 2.21 70.69 69.81 1.23 73.75 71.99 2.39 69.73 69.72 0.01 72.34 67.26 7.03
Vote 95.64 95.64 0.00 95.64 95.64 0.00 95.64 95.64 0.00 95.64 95.64 0.00 95.63 95.63 0.00
Vowel 66.87 66.87 0.00 90.20 90.20 0.00 81.42 81.42 0.00 78.08 78.08 0.00 84.95 84.95 0.00
Waveform 81.58 79.22 2.89 76.00 74.56 1.89 76.92 76.70 0.29 78.08 77.46 0.79 85.22 84.44 0.92
Wine 99.44 93.82 5.65 97.78 93.82 4.05 97.78 93.26 4.62 96.67 93.82 2.95 98.31 93.82 4.57
Zoo 94.09 94.09 0.00 97.09 97.09 0.00 92.09 92.09 0.00 95.09 95.09 0.00 96.04 95.18 0.90
CLL-SUB-111 94.59 86.66 8.39 95.50 77.74 18.59 88.29 76.84 12.97 93.69 90.36 3.56 96.40 91.17 5.42
GLI-85 100 100 0.00 98.82 98.82 0.00 98.82 98.82 0.00 100 100 0.00 98.83 98.83 0.00
SMK-CAN-187 91.44 91.44 0.00 88.77 88.77 0.00 85.03 85.03 0.00 86.10 86.10 0.00 91.98 91.98 0.00
TOX-171 95.32 95.32 0.00 94.74 94.74 0.00 75.44 75.44 0.00 81.29 81.29 0.00 98.83 98.83 0.00
warpAR10P 93.08 84.62 9.09 95.38 73.08 23.39 86.92 80.77 7.08 86.92 85.38 1.77 95.38 90.77 4.84
Average1 83.88 82.43 1.73 84.40 81.39 3.56 81.63 80.54 1.34 82.51 81.72 0.96 86.07 84.30 2.05
Average2 82.10 81.56 0.66 84.17 83.76 0.49 82.19 81.78 0.50 83.46 82.98 0.57 85.21 83.85 1.59
∗ In Table 12, “acc” denotes the classification accuracy under the most suitable support and confidence thresholds, “accPre” denotes the classi-
fication accuracy under the predicted support and confidence thresholds, and “dec(%)” denotes the accuracy decrement percentage when using
the predicted thresholds instead of the most suitable thresholds.
support and confidence thresholds are identified for each data set. Then jackknife validation method is
employed to predict support and confidence thresholds for each data set, and to test the prediction results
with the five different types of classifiers. Moreover, for each data set and each classifier, 5 × 10-fold
cross-validation is used for classification. Procedure ThresholdPrediction&Validation shows the details.
It should be noted that, when clustering data sets according to their metrics, the clustering tool
gCLUTO [62] is employed. When building support and confidence threshold recommendation models,
partial least square regression (PLSR) method [54] is used.
5.3.2. Results and analysis

Table 12 shows the support and confidence threshold prediction and validation results in terms of
classification accuracy for the five different types of classifiers.
From Table 12 we observe that when using the predicted support and confidence thresholds instead of
the most suitable thresholds for each data set, the average classification accuracy (Average1) over the 35
data sets decreases i) from 83.88% to 82.43%, the decrement is 1.73% for Naive Bayes; ii) from 84.40%
to 81.39%, the decrement is 3.56% for IB1; iii) from 81.63% to 80.54%, the decrement is 1.34% for
C4.5; iv) from 82.51% to 81.72%, the decrement is 0.96% for PART and v) from 86.07% to 84.3%, the
decrement is 2.05% for SVM.
We also observe that the number of data sets whose classification accuracy decrement is less than 5%
is 30 out of 35 for Naive Bayes, 27 out of 35 for IB1, 32 of out 35 for C4.5, 32 out of 35 for PART and
30 out of 35 for SVM. By excluding the few data sets whose classification accuracy decrement exceeds
5%, the average classification accuracy (Average2) decrement is 0.66% for Naive Bayes, 0.49% for IB1,
0.5% for C4.5, 1.59% for PART.
The results show that for most of the data sets, the classification accuracies of a classifier with the
predicted thresholds keep invariant or just have a slight difference when compared to that with the most
suitable thresholds. This indicates that, in practice, the metrics extracted from data sets can be employed
to predict the support and confidence thresholds for the proposed algorithm FEAST.
6. Conclusion
In this article, we have presented a novel association rule mining based feature subset selection algo-
rithm (viz, FEAST) and the corresponding support and confidence threshold prediction method, with an
aim to get proper features for machine learning and data mining algorithms, so as to further improve
their performance.
We have also compared the proposed algorithm FEAST with the other seven representative feature se-
lection algorithms, including five well-known algorithms CFS, Consistency, FCBF, FRanker and MIFS,
the algorithm INTERACT aiming at solving feature interaction, and an associative-rule-based algo-
rithm FSBAR, upon both the five synthetic data sets and the 35 real world data sets. The results on the
synthetic data sets show that FEAST can identify relevant features and remove redundant ones while
reserving feature interaction. The results on the real world data sets show that our proposed algorithm
FEAST outperformed all the other seven feature selection algorithms in terms of the average accuracy
improvement and the Win/Draw/Loss records of all the five different types of classifiers Naive Bayes,
IB1, C4.5, PART and SVM.
What’s more, the corresponding threshold prediction method has been tested extensively using the
35 real world data sets as well. The results showed that the proposed support and confidence threshold
prediction (viz, PLSR-based) method can be used to provide proper support and confidence thresholds
for FEAST.
Acknowledgements
This work is supported by the National Natural Science Foundation of China under grant 61070006.
The authors would like to thank the editor and the referees for their helpful comments.
References
[1] L.C. Molina, L. Belanche and À. Nebot, Feature Selection Algorithms: A Survey and Experimental Evaluation, in:
Proceedings of the 2002 IEEE International Conference on Data Mining, IEEE, 2002, pp. 306–313.
[2] G.H. John, R. Kohavi and K. Pfleger, Irrelevant features and the subset selection problem, in: Proceedings of the eleventh
international conference on machine learning, Vol. 129, Citeseer, 1994, pp. 121–129.
[3] L. Yu and H. Liu, Redundancy based feature selection for microarray data, in: Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and data mining, ACM, 2004, pp. 737–742.
[4] G. Forman, An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning
Research 3 (2003), 1289–1305.
[5] I. Kononenko, Estimating attributes: Analysis and extensions of RELIEF, in: Proceedings of the European conference on
machine learning on Machine Learning, Springer, 1994, pp. 171–182.
[6] K. Kira and L.A. Rendell, The feature selection problem: Traditional methods and a new algorithm, in: Proceedings of
the National Conference on Artificial Intelligence, John Wiley and Sons Ltd, 1992, pp. 129–134.
[7] R. Battiti, Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on
Neural Networks 5(4) (2002), 537–550.
[8] M.A. Hall, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, The University of Waikato, 1999.
[9] H. Liu and R. Setiono, A probabilistic approach to feature selection: A Filter Solution, in: Proceedings of the 13th
International Conference on Machine Learning, Citeseer, 1996.
[10] L. Yu and H. Liu, Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution, in: Proceed-
ings of 20th International Conference on Machine Leaning, 2003, pp. 856–863.
[11] A. Jakulin and I. Bratko, Testing the significance of attribute interactions, in: Proceedings of the twenty-first international
conference on Machine learning, ACM, 2004, pp. 409–416.
[12] Z. Zhao and H. Liu, Searching for interacting features in subset selection, Intelligent Data Analysis 13(2) (2009), 207–
228.
[13] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen and A.I. Verkamo, Finding interesting rules from large sets
of discovered association rules, in: Proceedings of the third international conference on Information and knowledge
management, ACM, 1994, pp. 401–407.
[14] V. Jovanoski and N. Lavrač, Classification rule learning with APRIORI-C, Progress in artificial intelligence (2001),
111–135.
[15] G. Chen, H. Liu, L. Yu, Q. Wei and X. Zhang, A new approach to classification based on association rule mining,
Decision Support Systems 42(2) (2006), 674–689.
[16] W. Li, J. Han and J. Pei, CMAR: accurate and efficient classification based on multiple class-association rules, in:
Proceedings IEEE International Conference on Data Mining, IEEE, 2002, pp. 369–376.
[17] X. Yin and J. Han, CPAR: Classification based on predictive association rules, in: Proceedings of the third SIAM inter-
national conference on data mining, Society for Industrial and Applied, 2003, pp. 331–335.
[18] X. Zhu, Q. Song and Z. Jia, A Weighted Voting-Based Associative Classification Algorithm, Computer Journal 53(6)
(2009), 786–801.
[19] J. Xie, J. Wu and Q. Qian, Feature Selection Algorithm Based on Association Rules Mining Method, in: Proceedings of
the 2009 Eight IEEE/ACIS international Conference on Computer and information Science, IEEE, 2009, pp. 357–362.
[20] A.L. Oliveira and A. Sangiovanni-Vincentelli, Constructive induction using a non-greedy strategy for feature selection,
in: Proceedings of ninth international conference on machine learning, Citeseer, 1992, pp. 355–360.
[21] J.C. Schlimmer, Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal
pruning, in: Proceedings of the Tenth International Conference on Machine Learning, Citeseer, 1993, pp. 284–290.
[22] H. Almuallim and T.G. Dietterich, Learning boolean concepts in the presence of many irrelevant features, Artificial
Intelligence 69(1–2) (1994), 279–305.
[23] M. Dash, H. Liu and H. Motoda, Consistency based feature selection, Knowledge Discovery and Data Mining (2000),
98–109.
[24] C. Sha, X. Qiu and A. Zhou, Feature Selection Based on a New Dependency Measure, in: Fifth International Conference
on Fuzzy Systems and Knowledge Discovery, Vol. 1, IEEE, 2008, pp. 266–270.
[25] B. Raman and T.R. Ioerger, Instance based filter for feature selection, Journal of Machine Learning Research 1 (2002),
1–23.
[26] H. Park and H.C. Kwon, Extended Relief Algorithms in Instance-Based Feature Filtering, in: Proceedings of the Sixth
International Conference on Advanced Language Processing and Web Information Technology, IEEE, 2007, pp. 123–
128.
[27] D. Koller and M. Sahami, Toward optimal feature selection, in: Proceedings of International Conference on Machine
Learning, Citeseer, 1996, pp. 284–292.
[28] D.A. Bell and H. Wang, A formalism for relevance and its application in feature subset selection, Machine learning 41(2)
(2000), 175–195.
[29] M. Last, A. Kandel and O. Maimon, Information-theoretic algorithm for feature selection, Pattern Recognition Letters
22(6–7) (2001), 799–811.
[30] P. Chanda, Y.R. Cho, A. Zhang and M. Ramanathan, Mining of Attribute Interactions Using Information Theoretic
Metrics, in: Proceedings of IEEE international Conference on Data Mining Workshops, IEEE, 2009, pp. 350–355.
[31] M. Dash and H. Liu, Feature selection for classification, Intelligent Data Analysis 1(3) (1997), 131–156.
[32] J. Souza, Feature Selection with a General Hybrid Algorithm, Ph.D. thesis, University of Ottawa (2004).
[33] A.L. Blum and P. Langley, Selection of relevant features and examples in machine learning, Artificial intelligence 97(1-2)
(1997), 245–271.
[34] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research 3
(2003), 1157–1182.
[35] H. Liu and R. Setiono, Chi2: Feature selection and discretization of numeric attributes, in: Proceedings of the Seventh
International Conference on Tools with Artificial Intelligence, IEEE, 2002, pp. 388–391.
[36] I.B. Jeffery, D.G. Higgins and A.C. Culhane, Comparison and evaluation of methods for generating differentially ex-
pressed gene lists from microarray data, Bmc Bioinformatics 7(1) (2006), 359–374.
[37] I.B. Jeffery, S.F. Madden, P.A. McGettigan, G. Perriere, A.C. Culhane and D.G. Higgins, Integrating transcription factor
binding site information with gene expression datasets, Bioinformatics 23(3) (2007), 298–305.
[38] M.E. ElAlami, A filter model for feature subset selection based on genetic algorithm, Knowledge-Based Systems 22(5)
(2009), 356–362.
[39] R. Kohavi and G.H. John, Wrappers for feature subset selection, Artificial intelligence 97(1-2) (1997), 273–324.
[40] P. Langley and S. Sage, Oblivious Decision Trees and Abstract Cases, in: Proceedings of the AAAI-94 Case-Based
Reasoning Workshop, 1994, pp. 113–117.
[41] F. Fleuret, Fast binary feature selection with conditional mutual information, Journal of Machine Learning Research 5
(2004), 1531–1555.
[42] I.A. Gheyas and L.S. Smith, Feature subset selection in large dimensionality domains, Pattern Recognition 43(1) (2010),
5–13.
[43] J. Han, J. Pei, Y. Yin and R. Mao, Mining frequent patterns without candidate generation: A frequent-pattern tree ap-
proach, Data Mining and Knowledge Discovery 8(1) (2004), 53–87.
[44] J. Gama and P. Brazdil, Characterization of classification algorithms, Progress in Artificial Intelligence (1995), 189–200.
[45] K.A. Smith, F. Woo, V. Ciesielski and R. Ibrahim, Matching data mining algorithm suitability to data characteristics
using a self-organising map, in: Proceedings of the First International Workshop on Hybrid Intelligent Systems, 2001,
pp. 169–179.
[46] P.B. Brazdil, C. Soares and J.P. Da Costa, Ranking learning algorithms: Using IBL and meta-learning on accuracy and
time results, Machine Learning 50(3) (2003), 251–277.
[47] A. Kalousis, J. Gama and M. Hilario, On data and algorithms: Understanding inductive performance, Machine Learning
54(3) (2004), 275–312.
[48] S. Ali and K.A. Smith, On learning algorithm selection for classification, Applied Soft Computing 6(2) (2006), 119–138.
[49] N. Tatti, Distances between data sets based on summary statistics, The Journal of Machine Learning Research 8 (2007),
131–154.
[50] D.J.C. MacKay, Information theory, inference, and learning algorithms, Cambridge Univ Pr, 2003.
[51] G.I. Webb, Multiboosting: A technique for combining boosting and wagging, Machine Learning 40(2) (2000), 159–196.
[52] R.A. Jonhson and D.W. Wichern, Applied multivariate statistical analysis, Madison: Prentice Hall International (1998),
155–162.
[53] J. Cohen, Applied multiple regression/correlation analysis for the behavioral sciences, Lawrence Erlbaum, 2003.
[54] P. Geladi and B.R. Kowalski, Partial least square regression: A tutorial, Analytica Chimica Acta 185 (1986), 1–17.
[55] A. Asuncion and D.J. Newman, UCI machine learning repository, 2007.
[56] U. Fayyad and K. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, in: Pro-
ceedings of the Thirteenth International Joint Conference on Artificial Intelligence, 1993, pp. 1022–1027.
[57] G.H. John and P. Langley, Estimating continuous distributions in Bayesian classifiers, in: Proceedings of the eleventh
conference on uncertainty in artificial intelligence, Vol. 1, Citeseer, 1995, pp. 338–345.
[58] D.W. Aha, D. Kibler and M.K. Albert, Instance-based learning algorithms, Machine Learning 6(1) (1991), 37–66.
[59] J.R. Quinlan, C4.5: programs for machine learning, Morgan Kaufmann, 1993.
[60] E. Frank and I.H. Witten, Generating Accurate Rule Sets Without Global Optimization, in: Proceedings of the Fifteenth
International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., 1998, pp. 144–151.
[61] J.C. Platt, Sequential minimal optimization: A fast algorithm for training support vector machines, Advances in Kernel
MethodsSupport Vector Learning 208 (1998), 1–21.
[62] M. Rasmussen and G. Karypis, gCLUTO–An Interactive Clustering, Visualization, and Analysis System, 2004.
[63] I. Rish, J. Hellerstein and T. Jayram, An analysis of data characteristics that affect naive Bayes performance, in: Pro-
ceedings of the Eighteenth Conference on Machine Learning, 2001.
Appendix
A. CARset and AARset over the example data set Dxor
The CARset and AARset of Dxor (see Table 1) under minSupp = 20%, minConf = 70% and minLift
= 1 are listed as follows. Where Supp and Conf denote the support and confidence of a rule, respectively.
1. Classification association rule set (CARset)
(1) {F1 = 1 ∧ F2 = 0} ⇒ {Y = 1}@ (Supp = 25%, Conf = 100%)
(2) {F1 = 1 ∧ F3 = 1} ⇒ {Y = 1}@ (Supp = 25%, Conf = 100%)
(3) {F1 = 1 ∧ F2 = 1} ⇒ {Y = 0}@ (Supp = 25%, Conf = 100%)
(4) {F1 = 0 ∧ F3 = 0} ⇒ {Y = 0}@ (Supp = 25%, Conf = 100%)
(5) {F1 = 0 ∧ F2 = 1} ⇒ {Y = 1}@ (Supp = 25%, Conf = 100%)
(6) {F1 = 0 ∧ F3 = 0} ⇒ {Y = 1}@ (Supp = 25%, Conf = 100%)
(7) {F1 = 0 ∧ F2 = 0} ⇒ {Y = 0}@ (Supp = 25%, Conf = 100%)
(8) {F1 = 0 ∧ F3 = 1} ⇒ {Y = 0}@ (Supp = 25%, Conf = 100%)
(9) {F1 = 1 ∧ F2 = 0 ∧ F3 = 1} ⇒ {Y = 1}@ (Supp = 25%, Conf = 100%)
(10) {F1 = 1 ∧ F2 = 0 ∧ F4 = 1} ⇒ {Y = 1}@ (Supp = 25%, Conf = 100%)
(11) {F1 = 1 ∧ F3 = 1 ∧ F4 = 1} ⇒ {Y = 1}@ (Supp = 25%, Conf = 100%)
(12) {F1 = 1 ∧ F2 = 1 ∧ F3 = 0} ⇒ {Y = 0}@ (Supp = 25%, Conf = 100%)
(13) {F1 = 1 ∧ F2 = 1 ∧ F4 = 1} ⇒ {Y = 0}@ (Supp = 25%, Conf = 100%)
(14) {F1 = 1 ∧ F3 = 0 ∧ F4 = 1} ⇒ {Y = 0}@ (Supp = 25%, Conf = 100%)
(15) {F1 = 0 ∧ F2 = 1 ∧ F3 = 0} ⇒ {Y = 1}@ (Supp = 25%, Conf = 100%)
(16) {F1 = 0 ∧ F2 = 0 ∧ F3 = 1} ⇒ {Y = 0}@ (Supp = 25%, Conf = 100%)
(17) {F1 = 1 ∧ F2 = 0 ∧ F3 = 1 ∧ F4 = 1} ⇒ {Y = 1}@ (Supp = 25%, Conf = 100%)
(18) {F1 = 1 ∧ F2 = 1 ∧ F3 = 0 ∧ F4 = 1} ⇒ {Y = 0}@ (Supp = 25%, Conf = 100%)
2. Atomic association rule set (AARset)
(1) {F1 = 1} ⇒ {F4 = 1}@ (Supp = 50%, Conf = 100%)
(2) {F2 = 1} ⇒ {F3 = 0}@ (Supp = 50%, Conf = 100%)
(3) {F2 = 0} ⇒ {F3 = 1}@ (Supp = 50%, Conf = 100%)
(4) {F3 = 0} ⇒ {F2 = 1}@ (Supp = 50%, Conf = 100%)
(5) {F3 = 1} ⇒ {F2 = 0}@ (Supp = 50%, Conf = 100%)
(6) {F4 = 0} ⇒ {F1 = 0}@ (Supp = 25%, Conf = 100%)
(7) {F2 = 1} ⇒ {F4 = 1}@ (Supp = 37.5%, Conf = 75%)
(8) {F2 = 0} ⇒ {F4 = 1}@ (Supp = 37.5%, Conf = 75%)
(9) {F3 = 1} ⇒ {F4 = 1}@ (Supp = 37.5%, Conf = 75%)
(10) {F3 = 0} ⇒ {F4 = 1}@ (Supp = 37.5%, Conf = 75%)
Copyright of Intelligent Data Analysis is the property of IOS Press and its content may not be
copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for
individual use.

Wrink

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Wrink

Încărcat de

Drepturi de autor:

Formate disponibile

Intelligent Data Analysis 17 (2013) 803–835 803

A novel feature subset selection algorithm

3. Feature subset selection algorithm

3.1. Strong, classification and atomic association rules

3.2. Definitions of relevant, redundant and interactive features

3.3. Feature subset selection algorithm

Association rule mining

Relevant feature value set discovery

Redundant feature value elimination

RFVSet without redundance

Feature subset identification

Fig. 1. Framework of FEAST.

1. Association rule mining

4. Support and confidence threshold prediction method

4.1. General view of the method

Model construction Threshold prediction

Data sets Data set

Fig. 2. Threshold prediction framework.

4.2. Data set metrics

4.3. Identifying the most suitable support and confidence thresholds

4.4. Threshold prediction model construction

5. Experimental results and analysis

5.1. Benchmark data sets

5.1.1. Synthetic data sets

5.1.2. Real world data sets

5.2. Experimental results of feature subset selection

5.2.1. Experiment setup

5.2.2. Results and analysis

2. Results on real world data sets

1). Number of selected features

2). Classification accuracy comparison

3). Runtime comparison

5.2.3. Sensitive analysis of the support and confidence thresholds

2. Sensitive analysis upon classification accuracy

3. Sensitive analysis upon runtime

(a) Number of features vs. Support threshold

70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100 70 80 90 100

100 200 200 200

(b) Number of features vs. Confidence threshold

Naive Bayes IB1 C4.5 PART SVM

(a) Accuracies vs. Support threshold

(b) Accuracies vs. Confidence threshold

(a) Runtime vs. Support threshold

600 600 2000 100

(b) Runtime vs. Confidence threshold

Procedure Threshold Prediction&Validation

5.3. Experimental results of support and confidence threshold prediction

5.3.1. Experimental process

5.3.2. Results and analysis

A. CARset and AARset over the example data set Dxor

S-ar putea să vă placă și