Documente Academic
Documente Profesional
Documente Cultură
Editor:
Haym Hirsh
Abstra
t
Variable sele
tion, the pro
ess of identifying input variables that are relevant to a parti
ular
learning problem, has re
eived mu
h attention in the learning
ommunity. Methods that
employ a learning algorithm as a part of the sele
tion pro
ess (wrappers) have been shown
to outperform methods that sele
t variables independently from the learning algorithm
(lters), but only at great
omputational expense. We present a randomized wrapper
algorithm whose
omputational requirements are within a
onstant fa
tor of simply learning
in the presen
e of all input variables, provided that the number of relevant variables is
small and known in advan
e. We then show how to remove the latter assumption, and
demonstrate performan
e on several problems.
1. Introdu
tion
variables with respe
t to the learner. However, exe
uting the learning algorithm for ea
h
sele
tion of variables during the sear
h ultimately renders the approa
h intra
table in the
presen
e of many irrelevant variables.
In spite of the
ost, variable sele
tion
an play an important role in learning. Irrelevant
variables
an often degrade the performan
e of a learning algorithm, parti
ularly when data
are limited. The main
omputational
ost asso
iated with the wrapper method is usually
that of exe
uting the learning algorithm. The learner must produ
e a hypothesis for ea
h
subset of the input variables. Even greedy sele
tion methods (Caruana and Freitag, 1994)
that ignore large areas of the sear
h spa
e
an produ
e a large number of
andidate variable
sets in the presen
e of many irrelevant variables.
Randomized variable elimination avoids the
ost of evaluating many variable sets by
taking large steps through the spa
e of possible input sets. The number of variables eliminated in a single step depends on the number of
urrently sele
ted variables. We present
a
ost fun
tion whose purpose is to strike a balan
e between the probability of failing to
sele
t su
essfully a set of irrelevant variables and the
ost of running the learning algorithm
many times. We use a form of ba
kward elimination approa
h to simplify the dete
tion of
relevant variables. Removal of any relevant variable should immediately
ause the learner's
performan
e to degrade. Ba
kward elimination also simplies the sele
tion pro
ess when
irrelevant variables are mu
h more
ommon than relevant variables, as we assume here.
Analysis of our
ost fun
tion shows that the
ost of removing all irrelevant variables is
dominated by the
ost of simply learning with all N variables. The total
ost is therefore
within a
onstant fa
tor of the
ost of simply learning the target fun
tion based on all
N input variables, provided that the
ost of learning grows at least polynomially in N .
The bound on the
omplexity of our algorithm is based on the
omplexity of the learning
algorithm being used. If the given learning algorithm exe
utes in time O(N 2 ), then removing
the N r irrelevant variables via randomized variable elimination also exe
utes in time
O(N 2 ). This is a substantial improvement
ompared to the fa
tor N or more in
rease
experien
ed in removing inputs one at a time.
2. Variable Sele
tion
The spe
i
problem of variable sele
tion is the following: Given a large set of input variables
and a target
on
ept or fun
tion, produ
e a subset of the original input variables that
predi
t best the target
on
ept or fun
tion when
ombined into a hypothesis by a learning
algorithm. The term \predi
t best" may be dened in a variety of ways, depending on the
spe
i
appli
ation and sele
tion algorithm. Ideally the produ
ed subset should be as small
as possible to redu
e training
osts and help prevent overtting.
From a theoreti
al viewpoint, variable sele
tion should not be ne
essary. For example,
the predi
tive power of Bayes rule in
reases monotoni
ally with the number of variables.
More variables should always result in more dis
riminating power, and removing variables
should only hurt. However, optimal appli
ations of Bayes rule are intra
table for all but the
smallest problems. Many ma
hine learning algorithms perform sub-optimal operations and
do not
onform to the stri
t
onditions of Bayes rule, resulting in the potential for a performan
e de
line in the fa
e of unne
essary inputs. More importantly, learning algorithms
usually have a
ess to a limited number of examples. Unrelated inputs require additional
1332
apa
ity in the learner, but do not bring new information in ex
hange. Variable sele
tion
is thus a ne
essary aspe
t of indu
tive learning.
A variety of approa
hes to variable sele
tion have been devised. Most methods
an be
pla
ed into one of two
ategories: lter methods or wrapper methods. Filter approa
hes
perform variable sele
tion independently of the learning algorithm, while wrappers make
learner-dependent sele
tions. A third group of spe
ial purpose methods perform feature
sele
tion in the
ontext of neural networks, known as parameter pruning. These methods
annot dire
tly perform variable sele
tion for arbitrary learning algorithms; they are
approa
hes to removing irrelevant inputs from learning elements.
Many variable sele
tion algorithms (although not all) perform some form of sear
h in the
spa
e of variable subsets as part of their operation. A forward sele
tion algorithm begins
with the empty set and sear
hes for variables to add. A ba
kward elimination algorithm
begins with the set of all variables and sear
hes for variables to remove. Optionally, forward
algorithms may o
asionally
hoose to remove variables, and ba
kward algorithms may
hoose to add variables. This allows the sear
h to re
over from previous poor sele
tions.
The advantage of forward sele
tion is that, in the presen
e of many irrelevant variables, the
size of the subsets will remain relatively small, helping to speed evaluation. The advantage of
ba
kward elimination is that re
ognizing irrelevant variables is easier. Removing a relevant
variable from an otherwise
omplete set should
ause a de
line in the evaluation, while
adding a relevant variable to an in
omplete set may have little immediate impa
t.
2.1 Filters
Filter methods use statisti
al measures to evaluate the quality of the variable subsets. The
goal is to nd a set of variables that is best with respe
t to the spe
i
quality measure.
Determining whi
h variables to in
lude may either be done via an expli
it sear
h in the
spa
e of variable subsets, or by numeri
ally weighting the variables individually and then
sele
ting those with the largest weight. Filter methods often have the advantage of speed.
The statisti
al measures used to evaluate variables typi
ally require very little
omputation
ompared to
ost of running a learning algorithm many times. The disadvantage is that
variables are evaluated independently, not in the
ontext of the learning problem.
Early ltering algorithms in
lude FOCUS (Almuallim and Dietteri
h, 1991) and Relief
(Kira and Rendell, 1992). FOCUS sear
hes for a smallest set of variables that
an
ompletely
dis
riminate between target
lasses, while Relief ranks variables a
ording to a distan
e
metri
. Relief sele
ts training instan
es at random when
omputing distan
e values. Note
that this is not related to our approa
h of sele
ting variables at random.
De
ision trees have also been employed to sele
t input variables by rst indu
ing a tree,
and then sele
ting only those variables tested by de
ision nodes (Cardie, 1993; Kubat et al.,
1993). In another vein, Koller and Sahami (1996) dis
uss a variable sele
tion algorithm
based on
ross entropy and information theory.
Methods from statisti
s also provide a basis for a variety of variable ltering algorithms.
Correlation-based feature sele
tion (CFS) (Hall, 1999) attempts to nd a set of variables that
are ea
h highly
orrelated with the target fun
tion, but not with ea
h other. The ChiMerge
(Kerber, 1992) and Chi2 algorithms (Liu and Setiono, 1997) remove both irrelevant and
redundant variables using a 2 test to merge adja
ent intervals of ordinal variables.
1333
Other methods from statisti
s solve problems
losely related to variable sele
tion. For
example, prin
ipal
omponent analysis (see Dunteman, 1989) is a method for transforming
the observed variables into a smaller number of dimensions, as opposed to removing irrelevant or redundant variables. Proje
tion pursuit (Friedman and Tukey, 1974) and fa
tor
analysis (Thurstone, 1931) (see Cooley and Lohnes, 1971, for a detailed presentation) are
used both to redu
e dimensionality and to dete
t stru
ture in relationships among variables.
Dis
ussion of ltering methods for variable sele
tion also arises in the pattern re
ognition
literature. For example, Devijver and Kittler (1982) dis
uss the use of a variety of linear and
non-linear distan
e measures and separability measures su
h as entropy. They also dis
uss
several sear
h algorithms, su
h as bran
h and bound and plus l-take away r. Bran
h and
bound is an optimal sear
h te
hnique that relies on a
areful ordering of the sear
h spa
e
to avoid an exhaustive sear
h. Plus l-take away r is more akin to the standard forward and
ba
kward sear
h. At ea
h step, l new variables are sele
ted for in
lusion in the
urrent set
and r existing variables are removed.
2.2 Wrappers
Wrapper methods attempt to tailor the sele
tion of variables to the strengths and weaknesses
of spe
i
learning algorithms by using the performan
e of the learner to evaluate subset
quality. Ea
h
andidate variable set is evaluated by exe
uting the learning algorithm given
the sele
ted variables and then testing the a
ura
y of the resulting hypotheses. This
approa
h has the advantage of using the a
tual hypothesis a
ura
y as a measure of subset
quality. The problem is that the
ost of repeatedly exe
uting the learning algorithm
an
qui
kly be
ome prohibitive. Nevertheless, wrapper methods do tend to outperform lter
methods. This is not surprising given that wrappers evaluate variables in the
ontext of the
learning problem, rather than independently.
2.2.1 Algorithms
John, Kohavi, and P
eger (1994) appear to have
oined the term \wrapper" while resear
hing the method in
onjun
tion with a greedy sear
h algorithm, although the te
hnique has a
longer history (Devijver and Kittler, 1982). Caruana and Freitag (1994) also experimented
with greedy sear
h methods for variable sele
tion. They found that allowing the sear
h
to either add variables or remove them at ea
h step of the sear
h improved over simple
forward and ba
kward sear
hes. Aha and Bankert (1994) use a ba
kward elimination beam
sear
h in
onjun
tion with the IB1 learner, but found no eviden
e to prefer this approa
h
to forward sele
tion. OBLIVION (Langley and Sage, 1994) sele
ts variables for the nearest
neighbor learning algorithm. The algorithm uses a ba
kward elimination approa
h with a
greedy sear
h, terminating when the nearest neighbor a
ura
y begins to de
line.
Subsequent work by Kohavi and John (1997) used forward and ba
kward best-rst sear
h
in the spa
e of variable subsets. Sear
h operators generally in
lude adding or removing a
single variable from the
urrent set. This approa
h is
apable of produ
ing a minimal
set of input variables, but the
ost grows exponentially in the fa
e of many irrelevant
variables. Compound operators generate nodes deep in the sear
h tree early in the sear
h
by
ombining the best
hildren of a given node. However, the
ost of running the best-rst
sear
h ultimately remains prohibitive in the presen
e of many irrelevant variables.
1334
Hoeding ra
es (Maron and Moore, 1994) take a dierent approa
h. All possible models
(sele
tions) are evaluated via leave-one-out
ross validation. For ea
h of the N evaluations,
an error
onden
e interval is established for ea
h model. Models whose error lower bound
falls below the upper bound of the best model are dis
arded. The result is a set of models
whose error is insigni
antly dierent.
Several algorithms for
onstru
ting regression models are also forms of wrapper methods.
For example, Least angle regression (Efron et al., 2003), whi
h generalizes and improves
upon several forward sele
tion regression algorithms, adds variables to the model in
rementally.
Geneti
algorithms have been also been applied as a sear
h me
hanism for variable
sele
tion. Vafaie and De Jong (1995) des
ribe using a geneti
algorithm to perform variable
sele
tion. They used a straightforward representation in whi
h individual
hromosomes were
bit-strings with ea
h bit marking the presen
e or absen
e of a spe
i
variable. Individuals
were evaluated by training and then testing the learning algorithm. In a similar vein, SETGen (Cherkauer and Shavlik, 1996) used a tness (evaluation) fun
tion that in
luded both
the a
ura
y of the indu
ed model and the
omprehensibility of the model. The learning
model used in their experiments was a de
ision tree and
omprehensibility was dened as
a
ombination of tree size and number of features used. The FSS-EBNA algorithm (Inza
et al., 2000) used Bayesian Networks to mate individuals in a GA-based approa
h to variable
sele
tion.
The relevan
e-in-
ontext (RC) algorithm (Domingos, 1997) is based on the idea that
some features may only be relevant in parti
ular areas of the instan
e spa
e for instan
e
based (lazy) learners. Clusters of training examples are formed by nding examples of the
same
lass with nearly equivalent feature ve
tors. The features along whi
h the examples
dier are removed and the a
ura
y of the entire model is determined. If the a
ura
y
de
lined, the features are restored and the failed examples are removed from
onsideration.
The algorithm
ontinues until there are no more examples to
onsider. Results showed that
RC outperformed other wrapper methods with respe
t to a 1-NN learner.
2.2.2 Learner Sele
tions
Many learning algorithms already
ontain some (possibly indire
t) form of variable sele
tion,
su
h as pruning in de
ision trees. This raises the question of whether the variable sele
tions
made by the learner should be used by the wrapper. Su
h an approa
h would almost
ertainly run faster than methods that rely only on the wrapper to make variable sele
tions.
The wrapper sele
ts variables for the learner, and then exe
utes the learner. If the resulting
hypothesis is an improvement, then the wrapper further removes all variables not used in
the hypothesis before
ontinuing on with the next round of sele
tions.
This approa
h assumes the learner is
apable of making bene
ial variable sele
tions.
If this were true, then both lter and wrapper methods would be largely irrelevant. Even
the most sophisti
ated learning algorithms may perform poorly in the presen
e of highly
orrelated, redundant or irrelevant variables. For example, John, Kohavi, and P
eger (1994)
and Kohavi (1995) both demonstrate how C4.5 (Quinlan, 1993)
an be tri
ked into making
bad de
isions at the root. Variables highly
orrelated with the target value, yet ultimately
useless in terms of making bene
ial data partitions, are sele
ted near the root, leading to
1335
unne
essarily large trees. Moreover, these bad de
isions
annot be
orre
ted by pruning.
Only variable sele
tion performed outside the
ontext of the learning algorithm
an re
ognize
these types of
orrelated, irrelevant variables.
2.2.3 Estimating Performan
e
One question that any wrapper method must
onsider is how to obtain a good estimate of
the a
ura
y of the learner's hypothesis. Both the amount and quality of data available
to the learner ae
t the testing a
ura
y. Kohavi and John (1997) suggest using multiple
runs of ve-fold
ross-validation to obtain an error estimate. They determine the number
of
ross-validation runs by
ontinuing until the standard deviation of the a
ura
y estimate
is less than 1%. This has the ni
e property of (usually) requiring fewer runs for large data
sets. However, in general,
ross-validation is an expensive pro
edure, requiring the learner
to produ
e several hypotheses for ea
h sele
tion of variables.
these bounds. In other words, the number of training examples required by exponentiated
gradient algorithms grows only logarithmi
ly in the number of irrelevant inputs.
Exponentiated gradient algorithms may be applied to the problem of separating the set
of relevant variables from irrelevant variables by running them on the available data and
examining the resulting weights. Although exponentiated gradient algorithms produ
e a
minimum error t of the data in non-separable problems, there is no guarantee that su
h a
t will rely on the variables relevant to a non-linear t.
Many algorithms that are dire
tly appli
able in non-linear situations experien
e a performan
e de
line in the presen
e of irrelevant input variables. Even support ve
tor ma
hines,
whi
h are often touted as impervious to irrelevant variables, have been shown to improve
performan
e with feature sele
tion (Weston et al., 2000). A more general approa
h to
re
ognizing relevant variables is needed.
3. Setting
Our algorithm for randomized variable elimination (RVE) requires a set (or sequen
e) of
N -dimensional ve
tors xi with labels yi . The learning algorithm L is asked to produ
e a
hypothesis h based only on the inputs xij that have not been marked as irrelevant (alternatively, a prepro
essor
ould remove variables marked irrelevant). We assume that the
hypotheses bear some relation to the data and input values. A degenerate learner (su
h as
one that produ
es the same hypothesis regardless of data or input variables) will in pra
ti
e
ause the sele
tion algorithm ultimately to sele
t zero variables. This is true of most
wrapper methods. For the purposes of this arti
le, we use generalization a
ura
y as the
performan
e
riteria, but this is not a requirement of the algorithm.
We make the assumption that the number r of relevant variables is at least two to avoid
degenerate
ases in our analysis. The number of relevant variables should be small
ompared
to the total number of variables N . This
ondition is not
riti
al to the fun
tionality of the
RVE algorithm; however the benet of using RVE in
reases as the ratio of N to r in
reases.
Importantly, we assume that the number of relevant variables is known in advan
e, although
whi
h variables are relevant remains hidden. Knowledge of r is a very strong assumption
in pra
ti
e, as su
h information is not typi
ally available. We remove this assumption in
Se
tion 6, and present an algorithm for estimating r while removing variables.
4. The Cost Fun
tion
Randomized variable elimination is a wrapper method motivated by the idea that, in the
presen
e of many irrelevant variables, the probability of su
essfully sele
ting several irrelevant variables simultaneously at random from the set of all variables is high. The algorithm
omputes the
ost of attempting to remove k input variables out of n remaining variables
given that r are relevant. A sequen
e of values for k is then found by minimizing the aggregate
ost of removing all N r irrelevant variables. Note that n represents the number of
remaining variables, while N denotes the total number of variables in the original problem.
The rst step in applying the RVE algorithm is to dene the
ost metri
for the given
learning algorithm. The
ost fun
tion
an be based on a variety of metri
s, depending on
whi
h learning algorithm is used and the
onstraints of the appli
ation. Ideally, a metri
1337
would indi
ate the amount of
omputational eort required for the learning algorithm to
produ
e a hypothesis.
For example, an appropriate metri
for the per
eptron algorithm (Rosenblatt, 1958)
might relate to the number of weight updates that must be performed, while the number
of
alls to the data purity
riterion (e.g. information gain (Quinlan, 1986)) may be a good
metri
for de
ision tree indu
tion algorithms. Sample
omplexity represents a metri
that
an be applied to almost any algorithm, allowing the
ost fun
tion to
ompute the number of
instan
es the learner must see in order to remove the irrelevant variables from the problem.
We do not assume a spe
i
metri
for the denition and analysis of the
ost fun
tion.
4.1 Denition
The rst step of dening the
ost fun
tion is to
onsider the probability
p+ (n; r; k)
Y n r i
=
k 1
n i
i=0
of su
essfully sele
ting k irrelevant variables at random and without repla
ement, given that
there are n remaining and r relevant variables. Next we use this probability to
ompute the
expe
ted number of
onse
utive failures before a su
ess at sele
ting k irrelevant variables
from n remaining given that r are relevant.
E (n; r; k) =
p+ (n; r; k)
p+ (n; r; k)
yields the expe
ted number of
onse
utive trials in whi
h at least one of the r relevant
variables will be randomly sele
ted along with irrelevant variables prior to su
ess.
We now dis
uss the
ost of sele
ting and removing k variables, given n and r. Let
M (L; n) represent an upper bound on the
ost of running algorithm L based on n inputs.
In the
ase of a per
eptron, M (L; n)
ould represent an estimated upper bound on the
number of updates performed by an n-input per
eptron. In some instan
es, su
h as a
ba
kpropagation neural network (Rumelhart and M
Clelland, 1986), providing su
h a bound
may be troublesome. In general, the order of the worst
ase
omputational
ost of the
learner with respe
t to the number of inputs is all that is needed. The bounding fun
tion
should a
ount for any assumptions about the nature of the learning problem. For example,
if learning Boolean fun
tions requires less
omputational eort than learning real-valued
fun
tions, then M (L; n) should in
lude this dieren
e. The general
ost fun
tion des
ribed
below therefore need not make any additional assumptions about the data.
In order to simplify the notation somewhat, the following dis
ussion assumes a xed
algorithm for L. The expe
ted
ost of su
essfully removing k variables from n remaining
given that r are relevant is given by
Given this expe
ted
ost of removing k variables, we
an now dene re
ursively the
expe
ted
ost of removing all n r irrelevant variables. The goal is to minimize lo
ally the
expe
ted
ost of removing k inputs with respe
t to the expe
ted remaining
ost, resulting
in a global minimum expe
ted
ost for removing all n r irrelevant variables. The use of a
greedy minimization step relies upon the assumption that M (L; n) is monotoni
in n. This
is reasonable in the
ontext of metri
s su
h as number of updates, number of data purity
tests, and sample
omplexity. The
ost (with respe
t to learning algorithm L) of removing
n r irrelevant variables is represented by
The rst part of the minimization term represents the
ost of removing the rst k variables
while the se
ond part represents the
ost of removing the remaining n r k irrelevant
variables. Note that we dene Isum (r; r) = 0.
The optimal value kopt (n; r) for k given n and r
an be determined in a manner similar
to
omputing the
ost of removing all n r irrelevant inputs. The value of k is
omputed
as
kopt (n; r) = arg min(I (n; r; k) + Isum (n k; r)) :
k
4.2 Analysis
The primary benet of this approa
h to variable elimination is that the
ombined
ost (in
terms of the metri
M (L; n)) of learning the target fun
tion and removing the irrelevant
input variables is within a
onstant fa
tor of the
ost of simply learning the target fun
tion
based on all N inputs. This result assumes that the fun
tion M (L; n) is at least a polynomial
of degree j > 0. In
ases where M (L; n) is sub-polynomial, running the RVE algorithm
in
reases the
ost of removing the irrelevant inputs by a fa
tor of log(n) over the
ost of
learning alone as shown below.
4.2.1 Removing Multiple Variables
We now show that the above average-
ase bounds on the performan
e of the RVE algorithm
hold. The worst-
ase is the unlikely
ondition in whi
h the algorithm always sele
ts a
relevant variable. We assume integer division here for simpli
ity. First let k = nr , whi
h
allows us to remove the minimization term from the equation for Isum (n; r) and redu
es
the number of variables. This value of k is not ne
essarily the value sele
ted by the above
equations. However, the
ost fun
tion is
omputed via dynami
programming, and the
fun
tion M (L; n) is assumed monotoni
. Any dieren
es between our
hosen value of k and
the a
tual value
omputed by the equations
an only serve to de
rease further the
ost of
the algorithm. Note also that, be
ause k depends on the number of
urrent variables n, k
hanges at ea
h iteration of the algorithm.
The probability of su
ess p+ (n; r; nr ) is minimized when n = r +1, sin
e there is only one
possible su
essful sele
tion and r possible unsu
essful sele
tions. This in turn maximizes
the expe
ted number of failures E (n; r; nr ) = r. The formula for I (n; r; k) is now rewritten
as
n
n
I (n; r; ) (r + 1) M (L; n
);
1339
i=0
The se
ond argument to the learning algorithm's
ost metri
M denotes the number of variables used at step i of the RVE algorithm. Noti
e that this number de
reases geometri
ally
toward r (re
all that n = r is the terminating
ondition for the algorithm). The logarithmi
lg(n) lg(r)
fa
tor of the upper bound on the summation, lg(1+1
=(r 1)) r lg(n), follows dire
tly from
the geometri
de
rease in the number of variables used at ea
h step of the algorithm. The
linear fa
tor r follows from the relationship between k and r. In general, as r in
reases, k
de
reases. Noti
e that as r approa
hes N , RVE and our
ost fun
tion degrade into testing
and removing variables individually.
Con
luding the analysis, we observe that for fun
tions M (L; n) that are at least polynomial in n with degree j > 0, the
ost in
urred by the rst pass of RVE (i = 0) will
dominate the remainder of the terms. The average-
ase
ost of running RVE in these
ases
is therefore bounded by Isum (N; r) O(rM (L; N )). An equivalent view is that the sum of a
geometri
ally de
reasing series
onverges to a
onstant. Thus, under the stated assumption
that r is small
ompared to (and independent of) N , RVE requires only a
onstant fa
tor
more
omputation than the learner alone.
When M (L; n) is sub-linear in n (e.g logarithmi
), ea
h pass of the algorithm
ontributes signi
antly to the total expe
ted
ost, resulting in an average-
ase bound of
O(r2 log(N )M (L; N )). Note that we use average-
ase analysis here be
ause in the worst
ase the algorithm
an randomly sele
t relevant variables indenitely. In pra
ti
e however,
long streaks of bad sele
tions are rare.
4.2.2 Removing Variables Individually
Consider now the
ost of removing the N r irrelevant variables one at a time (k = 1).
On
e again the probability of su
ess is minimized and the expe
ted number of failures is
maximized at n = r + 1. The total
ost of su
h an approa
h is given by
Isum (n; r) =
X
n r
i=1
(r + 1) M (L; n
i) :
Unlike the multiple variable removal
ase, the number of variables available to the learner at
ea
h step de
reases only arithmeti
ally, resulting in a linear number of steps in n. This is an
important deviation from the multiple sele
tion
ase, whi
h requires only a logarithmi
number of steps. The dieren
e between the two methods be
omes substantial when N is large.
Con
luding, the bound on the average-
ase
ost of RVE is Isum (N; r) O(NrM (L; N ))
when k = 1. This is true regardless of whether the variables are sele
ted randomly or
deterministi
ally at ea
h step.
In prin
iple, a
omparison should be made between the upper bound of the algorithm
that removes multiple variables per step and the lower bound of the algorithm that removes
a single variable per step in order to show the dieren
es
learly. However, generating
1340
Given: L; N; r
Isum [r + 1::N 0
kopt [r + 1::N 0
for i
r + 1 to N do
bestCost
for k
1 to i r do
temp I (i; r; k) + Isum [i
if temp < bestCost then
bestCost temp
bestK
k
Isum [i bestCost
kopt [i bestK
Randomized variable elimination
ondu
ts a ba
kward sear
h through the spa
e of variable
subsets, eliminating one or more variables per step. Randomization allows for sele
tion
of irrelevant variables with high probability, while sele
ting multiple variables allows the
algorithm to move through the spa
e without in
urring the
ost of evaluating the intervening
points in the spa
e. RVE
ondu
ts its sear
h along a very narrow traje
tory. The spa
e of
variable subsets is sampled sparsely, rather than broadly and uniformly. This stru
tured
1341
yet random sear
h allows RVE to redu
e substantially the total
ost of sele
ting relevant
variables.
A ba
kward approa
h serves two purposes for this algorithm. First, ba
kward elimination eases the problem of re
ognizing irrelevant or redundant variables. As long as a
ore set
of relevant variables remains inta
t, removing other variables should not harm the performan
e of a learning algorithm. Indeed, the learner's performan
e may in
rease as irrelevant
features are removed from
onsideration. In
ontrast, variables whose relevan
e depends
on the presen
e of other variables may have no noti
eable ee
t when sele
ted in a forward
manner. Thus, mistakes should be re
ognized immediately via ba
kward elimination, while
good sele
tions may go unre
ognized by a forward sele
tion algorithm.
The se
ond purpose of ba
kward elimination is to ease the pro
ess of sele
ting variables.
If most variables in a problem are irrelevant, then a random sele
tion of variables is naturally
likely to un
over them. Conversely, a random sele
tion is unlikely to turn up relevant
variables in a forward sear
h. Thus, the forward sear
h must work harder to nd ea
h
relevant variable than ba
kward sear
h does for irrelevant variables.
5.1 Algorithm
The algorithm begins by
omputing the values of kopt (i; r) for all r + 1 i n. Next
it generates an initial hypothesis based on all n input variables. Then, at ea
h step, the
algorithm sele
ts kopt (n; r) input variables at random for removal. The learning algorithm
is trained on the remaining n k inputs, and a hypothesis h is produ
ed. If the error
e(h0 ) of hypothesis h0 is less than the error e(h) of the previous hypothesis h (possibly
within a given toleran
e), then the sele
ted k inputs are marked as irrelevant and are all
simultaneously removed from future
onsideration. Kohavi and John (1997) provide an in
depth dis
ussion on evaluating and
omparing hypotheses based on limited data sets. If
the learner was unsu
essful, meaning the new hypothesis had a larger error, then at least
one of the sele
ted variables was relevant. A new set of inputs is sele
ted and the pro
ess
repeats. The algorithm terminates when all n r irrelevant inputs have been removed.
Table 2 shows the RVE algorithm.
The stru
tured sear
h performed by RVE is easily distinguished from other randomized
sear
h methods. For example, geneti
algorithms maintain a population of states in the
sear
h spa
e and randomly mate the states to produ
e ospring with properties of both
parents. The ee
t is an initially broad sear
h that targets more spe
i
areas as the sear
h
progresses. A wide variety of subsets are explored, but the
ost of so mu
h exploration
an
easily ex
eed the
ost of a traditional greedy sear
h. See Goldberg (1989) or Mit
hell (1996)
for detailed dis
ussions on how geneti
algorithms
ondu
t sear
h.
While GAs tend to drift through the sear
h spa
e based on the properties of individuals
in the population, the LVF algorithm (Liu and Setino, 1996) samples the spa
e of variable
subsets uniformly. LVF sele
ts both the size of ea
h subset and the member variables at
random. Although su
h an approa
h is not sus
eptible to \bad de
isions" or lo
al minima,
the probability of nding a best or even good variable subset de
reases exponentially as
the number of irrelevant variables in
reases. Unlike RVE, LVF is a ltering method, whi
h
relies on the in
onsisten
y rate (number of equivalent instan
es divided by number of total
instan
es) in the data with respe
t to the sele
ted variables.
1342
Given: L, n, r, toleran
e
ompute tables for Isum (i; r) and kopt (i; r)
h hypothesis produ
ed by L on n inputs
while n > r do
k
kopt (n; r)
sele
t k variables at random and remove them
h0 hypothesis produ
ed by L on n k inputs
if e(h0 ) e(h) toleran
e then
else
n
h
n k
h0
30
RVE
Single
M (L; N )
25
20
15
10
5
0
50
100
400
450
500
Figure 1: Plot of the expe
ted
ost of running RVE (Isum (N; r = 10)) along with the
ost
of removing inputs individually, and the estimated number of updates M (L; N ).
1000 instan
es in the training data, the
ost of running the learning algorithm is xed at
M (L; n) = 1897000(n + 1). Given the above
ost formula for an n-input per
eptron, a table
of values for Isum (n; r) and kopt (n; r)
an be
onstru
ted.
Figure 1 plots a
omparison of the
omputed
ost of the RVE algorithm, the
ost of
removing variables individually, and the estimated number of updates M (L; N ) of an N input per
eptron. The
al
ulated
ost of the RVE algorithm maintains a linear growth rate
with respe
t to N , while the
ost of removing variables individually grows as N 2 . This
agrees with our analysis of the RVE and individual removal approa
hes. Relationships
similar to that shown in Figure 1 arise for other values of r, although the
onstant fa
tor
that separates Isum (n; r) and M (L; n) in
reases with r.
After
reating the table kopt (n; r), the sele
tion and removal pro
ess begins. Sin
e the
seven-of-ten learning problem is linearly separable, the toleran
e for
omparing the new and
urrent hypotheses was set to near zero. A small toleran
e of 0.06 (equivalent to about 15
mis
lassi
ations) is ne
essary sin
e the thermal per
eptron does not guarantee a minimum
error hypothesis.
We also allow the
urrent hypothesis to bias the next by not randomizing the weights
(of remaining variables) after ea
h pass of RVE. Small value weights, suggesting potential
irrelevant variables,
an easily transfer from one hypothesis to the next, although this is
not guaranteed. Seeding the per
eptron weights may in
rease the
han
e of nding a linear
separator if one exists. If no separator exists, then seeding the weights should have minimal
impa
t. In pra
ti
e we found that the ee
t of seeding the weights was nullied by the
po
ket per
eptron's use of annealing.
1344
Number Inputs
100
90
80
70
60
50
40
30
20
10
0
RVE
Individual
6
9
Total Updates (109 )
12
15
Figure 2: A
omparison between the number of inputs on whi
h the per
eptrons are trained
and the mean aggregate number of updates performed by the per
eptrons.
6. Choosing
When
Is Unknown
The assumption that the number of relevant variables r is known has played a
riti
al
role in the pre
eding dis
ussion. In pra
ti
e, this is a strong assumption that is not easily
met. We would like an algorithm that removes irrelevant attributes e
iently without su
h
knowledge. One approa
h would be simply to guess values for r and see how RVE fares.
This is unsatisfying however, as a poor guess
an destroy the e
ien
y of RVE. In general,
guessing spe
i
values for r is di
ult, but pla
ing a loose bound around r may be mu
h
easier. In some
ases, the maximum value for r may be known to be mu
h less than N ,
while in other
ases, r
an always be bounded by 1 and N .
Given some bound on the maximum rmax and minimum rmin values for r, a binary
sear
h for r
an be
ondu
ted during RVE's sear
h for relevant variables. This relies on the
idea that RVE attempts to balan
e the
ost of learning against the
ost of sele
ting relevant
variables for removal. At ea
h step of RVE, a
ertain number of failures, E (n; r; k), are
expe
ted. Thus, if sele
ting variables for removal is too easy (i.e. we are sele
ting too few
variables at ea
h step), then the estimate for r is too high. Similarly, if sele
tion fails an
inordinate number of times, then the estimate for r is too low.
The
hoi
e of when to adjust r is important. The sele
tion pro
ess must be allowed to
fail a
ertain number of times for ea
h su
ess, but allowing too many failures will de
rease
the e
ien
y of the algorithm. We bound the number of failures by
1 E (n; r; k) where
1 > 1 is a
onstant. This allows for the failures pres
ribed by the
ost fun
tion along with
some amount of \bad lu
k" in the random variable sele
tions. The number of
onse
utive
su
esses is bounded similarly by
2 (r E (n; r; k)) where
2 > 0 is a
onstant. Sin
e
E (n; r; k)) is at most r, the value of this expression de
reases as the expe
ted number of
failures in
reases. In pra
ti
e
1 = 3 and
2 = 0:3 appear to work well.
rmax +rmin
su
ess , fail 0
h hypothesis produ
ed by L on n inputs
repeat
k
kopt (n; r)
sele
t k variables at random and remove them
h0 hypothesis produ
ed by L on n k inputs
if e(h0 ) e(h) toleran
e then
n n k
h h0
else
su
ess
fail 0
su ess + 1
if r rmin then
r, rmax , rmin
su
ess , fail 0
else if su
ess
2 (r
rmax r
r rmax +2 rmin
su
ess , fail
Algorithm
Mean Updates Mean Time (s) Mean Calls Mean Inputs
RVE (kopt )
5.5109
359.9
81.1
10.0
rmax = 20
6.5109
500.7
123.8
10.8
9
rmax = 40
8.010
603.8
151.3
10.2
rmax = 60
9.3109
678.8
169.0
10.0
9
rmax = 80
10.010
694.7
172.3
10.0
rmax = 100
11.7109
740.7
184.1
9.9
9
RVE (k = 1)
12.710
644.7
138.7
10.0
Table 4: Results of RVE and RVErS for several values of rmax . Mean
alls refers to the
number
alls made to the learning algorithm. Mean inputs refers to the number
of inputs used by the nal hypothesis.
performan
e of RVErS degrades slowly with respe
t to
ost. The dieren
e between RVErS
with rmax = 100 and RVE with k = 1 is signi
ant at the 95%
onden
e level (p = 0:049),
as is the dieren
e between RVErS with rmax = 20 and RVE with k = kopt (p = 0:0005).
However, this slow degradation does not hold in terms of run time or number of
alls
to the learning algorithm. Here, only versions of RVErS with rmax = 20 or 40 show an
improvement over RVE with k = 1.
The RVErS algorithm termination
riteria
auses the sharp in
rease in the number of
alls to the learning algorithm. Re
all that as n approa
hes r the probability of a failed
sele
tion in
reases. This means that the number of allowable sele
tion failures grows as the
algorithm nears
ompletion. Thus, the RVErS algorithm makes many
alls to the learner
using a small number of inputs n in an attempt to determine whether the sear
h should
be terminated. The sear
h for r
ompounds the ee
t. If, at the end of the sear
h, the
irrelevant variables have been removed but rmin and rmax have not
onverged, then the
algorithm must work through several failed sequen
es in order to terminate.
Figure 3 plots of the number of variables sele
ted
ompared to the average total number
of weight updates for rmax = 20, 60 and 100. The error bars represent the standard
deviation in the number of updates. Noti
e the jump in the number of updates required for
the algorithm to rea
h
ompletion (represented by number of inputs equals ten)
ompared
to the number of updates required to rea
h twenty remaining inputs. This pattern does not
appear in the results of either version of the RVE algorithm shown in Figure 2. Tra
es of
the RVErS algorithm support the
on
lusion that many
alls to the learner are needed to
rea
h termination even after the
orre
t set of variables has been found.
The in
rease in run times follows dire
tly from the in
reasing number of
alls to the
learner. The thermal per
eptron algorithm
arries a great deal of overhead not re
e
ted
by the number of updates. Sin
e the algorithm exe
utes for a xed number of epo
hs, the
run time of any
all to the learner will
ontribute noti
eably to the run time of RVErS,
regardless of the number of sele
ted variables. Contrast this behavior to that of learner
whose
ost is based more rmly on the number of input variables, su
h as naive Bayes.
Thus, even though RVErS always requires fewer weight updates than RVE with k = 1, the
latter may still run faster.
1348
Number Inputs
100
90
80
70
60
50
40
30
20
10
0
rmax = 20
rmax = 60
rmax = 100
6
9
Total Updates (109 )
12
15
Figure 3: A
omparison between the number of inputs on whi
h the thermal per
eptrons
are trained and the aggregate number of updates performed using the RVErS
algorithm.
This result suggests that the termination
riterion of the RVErS algorithm is
awed.
The large number of
alls to the learner at the end of the variable elimination pro
ess
wastes a portion of the advantage generated earlier in the sear
h. More importantly, the
ex
ess number of
alls to the learner does not respe
t the very
areful sear
h traje
tory
omputed by the
ost fun
tion. Although our
ost fun
tion for the learner M (L; n) does
take the overhead of the thermal per
eptron algorithm into a
ount, there is no allowan
e
for unne
essary
alls to the learner. Future resear
h with randomized variable elimination
should therefore in
lude a better termination
riterion.
r
Data Set
Variables Classes Train Size Test Size Values of rmax
internet-ad
1558
2
3279
CV 1558, 750, 100
mult. feature
240
10
2000
CV
240, 150, 50
DNA
180
3
2000
1186
180, 100, 50
LED
150
10
2000
CV
150, 75, 25
opt-digits
64
10
3823
1797
64, 40, 25
soybean
35
19
683
CV
35, 25, 15
si
k-euthyroid
25
2
3164
CV
25, 18, 10
monks-2-lo
al
17
2
169
432
17, 10, 5
Table 5: Summary of data sets.
Chapelle, Pontil, Poggio, and Vapnik (2000). The goal here is to show that our
omparatively liberal elimination method sa
ri
es little in terms of a
ura
y and gains mu
h in
terms of speed.
LED problem used here was generated using
ode available at the repository, and in
ludes
a
orruption of 10% of the
lass labels. Following Kohavi and John, the monks-2 data used
here in
ludes a lo
al (one of n) en
oding for ea
h of the original six variables for a total of
17 Boolean variables. The original monks-2 problem
ontains no irrelevant variables, while
the en
oded version
ontains six irrelevant variables.
7.3 Methodology
For ea
h data set and ea
h of the two learning algorithms (C4.5 and naive Bayes), we ran
four versions of the RVErS algorithm. Three versions of RVErS use dierent values of
rmax in order to show how the
hoi
e of rmax ae
ts performan
e. The fourth version is
equivalent to RVE with k = 1 using a stopping
riterion based on the number of
onse
utive
failures (as in RVErS). This measures the performan
e of removing variables individually
given that the number of relevant variables is
ompletely unknown. For
omparison, we
also ran forward step-wise sele
tion, ba
kward step-wise elimination and a hybrid ltering
algorithm. The ltering algorithm simply ranked the variables by gain-ratio, exe
uted the
learner using the rst 1, 2, 3, : : :, N variables, and sele
ted the best.
The learning algorithms used here provide no performan
e guarantees, and may produ
e
highly variable results depending on variable sele
tions and available data. All seven sele
tion algorithms therefore perform ve-fold
ross-validation using the training data to obtain
an average hypothesis a
ura
y generated by the learner for ea
h sele
tion of variables. The
methods proposed by Kohavi and John (1997)
ould be used to improve error estimates for
ases in whi
h the varian
e in hypothesis error rates is high. Their method should provide
reliable estimates for adjusting the values of rmin and rmax regardless of learning algorithm.
Preliminary experiments indi
ated that the RVErS algorithm is more prone to be
oming
bogged down during the sele
tion pro
ess than deterministi
algorithms. We therefore set
a small toleran
e (0:002) as shown in Table 3, whi
h allows the algorithm to keep only
very good sele
tions of variables while still preventing the sele
tion pro
ess from stalling
unne
essarily. We have not performed extensive tests to determine ideal toleran
e values.
The nal sele
tions produ
ed by the algorithms were evaluated in one of two ways.
Domains for whi
h there no spe
i
test set is provided were evaluated via ten-fold
rossvalidation. The remaining domains used the provided training and test sets. In the se
ond
ase, we ran ea
h of the four RVErS versions ve times in order to smooth out any
u
tuations due to the random nature of the algorithm.
7.4 Results
Tables 6{9 summarize the results of running the RVErS algorithm on the given data sets
using naive Bayes and C4.5 for learning algorithms. In the tables, iters denotes the number of sear
h iterations, evals denotes the number of subset evaluations performed, inputs
denotes the size of the nal set of sele
ted variables, error rates in
lude the standard deviation where appli
able, and
ost represents the total
ost of the sear
h with respe
t to the
learner's
ost metri
. The rst row in ea
h blo
k shows the performan
e of the learner prior
to variable sele
tion, while the remaining rows show the performan
e of the seven sele
tion
algorithms. Finally, \NA" indi
ates that the experiment was terminated due to ex
essive
omputational
ost.
1351
Data Set
internet
Learning
Algorithm
Bayes
Sele
tion
Algorithm
Iters
Subset
Evals
= 100
rmax = 750
rmax = 1558
k = 1
forward
ba
kward
lter
137
536
845
1658
20
137
536
845
1658
30810
1558
1558
= 100
= 750
rmax = 1558
k = 1
forward
ba
kward
lter
340
1233
1489
1761
19
340
1233
1489
1761
28647
1558
1558
= 50
= 150
rmax = 240
k = 1
forward
ba
kward
lter
53
84
112
341
20
186
240
53
84
112
341
4539
27323
240
= 50
= 150
rmax = 240
k = 1
forward
ba
kward
lter
306
459
474
460
26
151
240
306
459
474
460
5960
24722
240
rmax
internet
C4.5
rmax
rmax
mult-ftr
Bayes
rmax
rmax
mult-ftr
C4.5
rmax
rmax
Inputs
1558
37.8
9.2
17.5
8.8
18.9
837
1558
33.9
25.7
20.0
20.6
17.5
640
240
18.8
19.4
19.9
17.2
18.7
55.6
53.6
240
22.3
21.3
22.0
22.9
25.3
90.8
140
Per
ent
Error
3.00:9
3.01:2
3.21:2
2.90:8
3.01:2
2.50:8
NA
3.10:9
3.00:8
3.31:0
3.71:2
3.21:3
3.31:0
3.21:2
NA
3.11:0
34.14:5
18.32:0
17.54:7
17.52:2
15.73:0
12.31:6
13.91:7
22.52:7
22.04:0
22.12:0
20.22:7
22.13:5
20.52:5
20.43:5
20.43:1
21.22:7
Time
(se
)
0.5
165
790
1406
2685
22417
Sear
h
Cost
4.61106
6.13108
3.99109
8.26109
1.461010
5.32109
2614
48
5386
40656
78508
91204
18388
1.441010
2.04105
3.30107
2.61108
5.02108
6.02108
1.95107
77608
0.1
13
27
41
99
527
12097
83
0.6
241
427
519
523
2004
51018
354
3.98108
4.51105
3.09107
7.28107
1.13108
2.57108
7.55108
3.521010
2.30108
3.74104
1.13107
2.12107
2.66107
2.71107
5.06107
2.90109
1.93107
Table 6: Variable sele
tion results using the naive Bayes and C4.5 learning algorithms.
The performan
e of RVErS on the ve largest data sets is en
ouraging. In most
ases
RVErS was
omparable to the performan
e of step-wise sele
tion with respe
t to generalization, while requiring substantially less
omputation. This ee
t is most
lear in the mult-ftr
data set, where forward sele
tion with the C4.5 learner required nearly six CPU days to run
(for ten-fold
ross-validation) while the slowest RVErS version required just six hours. An
ex
eption to this trend o
urs with the internet-ad data using C4.5. Here, the huge
ost of
running C4.5 with most of the variables in
luded overwhelms RVErS's ability to eliminate
variables qui
kly. Only the most aggressive run of the algorithm, with rmax = 100, manages
to bypass the problem.
The internet-ad via C4.5 experiment highlights a se
ond point. Noti
e how the forward
sele
tion algorithm runs faster than all but one version of RVErS. In this
ase, the
ost and
1352
Data Set
DNA
Learning
Algorithm
Bayes
Sele
tion
Algorithm
Iters
Subset
Evals
= 50
rmax = 100
rmax = 180
k = 1
forward
ba
kward
lter
359
495
519
469
19
34
180
359
495
519
469
3249
5413
180
= 50
= 100
rmax = 180
k = 1
forward
ba
kward
lter
356
384
432
374
13
110
180
356
384
432
374
2262
13735
180
= 25
= 75
rmax = 150
k = 1
forward
ba
kward
lter
127
293
434
423
14
14
150
127
293
434
423
2006
1870
150
= 25
= 75
rmax = 150
k = 1
forward
ba
kward
lter
85
468
541
510
9
61
150
85
468
541
510
1286
7218
150
rmax
DNA
C4.5
rmax
rmax
LED
Bayes
rmax
rmax
LED
C4.5
rmax
rmax
Inputs
180
24.2
30.0
25.6
23.6
18.0
148
101
180
17.0
16.2
13.8
14.4
12.0
72.0
17.0
150
22.7
17.4
25.6
23.7
13.0
138.0
23.7
150
51.1
25.8
25.2
32.4
7.8
90.9
7.1
Per
ent
Error
6.7
4.70:7
4.90:8
5.00:5
4.70:3
5.8
6.5
5.7
9.7
8.11:7
7.11:5
6.51:2
6.51:1
5.9
8.7
7.6
30.33:0
26.93:9
26.03:3
25.92:6
27.02:1
26.62:9
30.12:6
27.12:1
43.94:5
42.03:0
42.54:5
40.85:7
42.52:7
27.03:2
43.53:5
27.33:5
Time
(se
)
0.08
52
75
84
76
269
1399
32
0.5
198
222
282
274
418
18186
163
0.09
19
50
86
85
141
667
34
0.5
89
363
440
439
196
11481
156
Sear
h
Cost
3.63105
1.52108
2.39108
2.89108
2.56108
2.99108
7.16109
1.33108
1.95104
8.42106
9.07106
1.21107
1.18107
2.33106
8.23108
7.10106
2.75105
3.97107
1.09108
2.02108
2.04108
1.54108
1.95109
8.49107
5.48104
1.01107
3.70107
4.48107
4.63107
9.52105
1.33109
1.69107
Table 7: Variable sele
tion results using the naive Bayes and C4.5 learning algorithms.
time of running C4.5 many times on a small number of variables is less than that of running
C4.5 few times on many variables. However, note that a slight
hange in the number of
iterations needed by the forward algorithm would
hange the time and
ost of the sear
h
dramati
ally. This is not the
ase for RVErS, sin
e ea
h iteration involves only a single
evaluation instead of O(N ) evaluations.
The number of subset evaluations made by RVErS is also important. Noti
e the growth
in number of evaluations with respe
t to the total (initial) number of inputs. For aggressive
versions of RVErS, growth is very slow, while more
onservative versions, su
h as k = 1
grow approximately linearly. This suggests that the theoreti
al results dis
ussed for RVE
remain valid for RVErS. Additional tests using data with many hundreds or thousands of
1353
Data Set
opt-digits
Learning
Algorithm
Bayes
Sele
tion
Algorithm
Iters
Subset
Evals
= 25
rmax = 40
rmax = 64
k = 1
forward
ba
kward
lter
111
157
162
150
17
41
64
111
157
162
150
952
1781
64
= 25
= 40
rmax = 64
k = 1
forward
ba
kward
lter
130
158
216
140
16
28
64
130
158
216
140
904
1378
64
= 15
= 25
rmax = 35
k = 1
forward
ba
kward
lter
142
135
132
88
13
19
35
142
135
132
88
382
472
35
= 15
= 25
rmax = 35
k = 1
forward
ba
kward
lter
118
158
139
117
16
18
35
118
158
139
117
435
455
35
rmax
opt-digits
C4.5
rmax
rmax
soybean
Bayes
rmax
rmax
C4.5
rmax
rmax
Inputs
64
14.2
13.2
14.4
14.0
16.0
25.0
37.0
64
12.0
10.8
10.4
11.4
15.0
38.0
50.0
35
12.6
11.9
11.2
12.3
12.3
18.0
31.3
35
16.3
14.7
16.3
16.1
14.8
19.1
30.8
Per
ent
Error
17.4
15.71:5
14.90:7
14.71:0
14.21:1
14.1
13.5
16.1
43.2
42.42:0
42.20:3
42.51:2
42.11:1
41.6
44.0
43.6
7.82:4
8.94:2
10.55:8
9.85:1
9.65:0
7.32:9
7.94:6
7.82:6
8.64:0
9.54:6
10.14:1
9.13:7
9.33:5
9.14:0
10.44:4
8.53:7
Time
(se
)
0.08
15.9
22.9
24.9
24.7
93.8
423.0
11.9
0.7
148
181
253
189
645
2842
87
0.02
5.9
5.8
5.8
4.6
8.9
37.5
2.0
0.04
13.5
18.6
17.3
14.9
33.7
69.0
3.7
Sear
h
Cost
2.59105
4.05107
5.92107
6.66107
6.76107
1.91108
1.39109
3.63107
2.15104
3.64106
4.42106
6.36106
5.33106
9.64106
9.55107
2.12106
2.40104
6.47106
6.26106
6.17106
4.67106
1.09107
3.63107
1.97106
1.21103
2.78105
1.90105
3.86105
3.52105
3.22105
1.75106
6.06104
Table 8: Variable sele
tion results using the naive Bayes and C4.5 learning algorithms.
variables would be instru
tive, but may not be feasible with respe
t to the deterministi
sear
h algorithms.
RVErS does not a
hieve the same e
onomy of subset evaluations on the three smaller
problems as on the larger problems. This is not surprising, sin
e the ratio of relevant
variables to total variables is mu
h smaller, requiring RVErS to pro
eed more
autiously.
In these
ases, the value of rmax has only a minor ee
t on performan
e, as RVErS is unable
to remove more than two or three variables in any given step.
One problem eviden
ed by both large and small data sets is that there appears to be no
lear
hoi
e of a best value for rmax . Conservative versions of RVErS tend to produ
e lower
error rates, but there are ex
eptions. In some
ases, rmax has very little ee
t on error.
However, in most
ases, small values of rmax have a distin
t positive ee
t on run time.
1354
Data Set
euthyroid
Learning
Algorithm
Bayes
Sele
tion
Algorithm
Iters
Subset
Evals
= 10
rmax = 18
rmax = 25
k = 1
forward
ba
kward
lter
30
39
46
35
5
16
25
30
39
46
35
118
263
25
= 10
= 18
rmax = 25
k = 1
forward
ba
kward
lter
49
63
55
54
7
16
25
49
63
55
54
151
269
25
=5
= 10
rmax = 17
k = 1
forward
ba
kward
lter
25
54
74
41
2
8
17
25
54
74
41
33
99
17
=5
= 10
rmax = 17
k = 1
forward
ba
kward
lter
36
84
79
55
2
13
17
36
84
79
55
33
139
17
rmax
C4.5
rmax
rmax
monks-2
Bayes
rmax
rmax
C4.5
rmax
rmax
Inputs
25
2.0
2.0
2.3
1.7
4.2
11.3
4.7
25
2.9
3.3
2.7
3.8
5.9
11.0
15.2
17
2.6
4.0
6.0
6.0
1.0
11.0
2.0
17
8.2
6.2
6.4
6.2
1.0
6.0
13.0
Per
ent
Error
6.21:3
4.61:5
4.81:2
5.11:4
5.01:2
4.61:1
4.21:3
4.20:6
2.71:0
2.40:8
2.20:7
2.50:8
2.30:9
2.40:7
2.50:9
2.71:1
39.4
36.13:2
37.22:3
37.43:1
36.83:1
32.9
38.4
40.3
23.6
16.710:9
4.60:4
6.54:7
4.40:0
32.9
4.4
35.6
Time
(se
)
0.02
2.1
1.3
1.6
1.5
3.4
14.4
1.4
0.2
21.0
27.6
25.3
29.4
51.8
200.0
15.4
0.01
0.02
0.05
0.08
0.04
0.02
0.13
0.01
0.03
0.8
1.9
1.8
1.4
0.6
3.9
0.4
Sear
h
Cost
7.54104
2.65106
3.73106
4.32106
4.86106
6.45106
5.91107
4.16106
1.00103
2.98104
4.58104
4.92104
6.39104
3.89104
6.73105
3.60104
3.11103
1.10105
2.50105
4.93105
2.89105
6.70104
9.93105
1.21105
5.14102
2.34104
3.74104
4.63104
4.13104
3.95102
1.61105
1.51104
Table 9: Variable sele
tion results using the naive Bayes and C4.5 learning algorithms.
The results suggest two other somewhat surprising
on
lusions. One is that ba
kward
elimination does not appear to have the
ommonly assumed positive ee
t on generalization.
Step-wise forward sele
tion tends to outperform step-wise ba
kward elimination, although
randomization often redu
es this ee
t. The se
ond
on
lusion is that the hybrid lter
algorithm performs well in some
ases, but worse than RVErS and step-wise sele
tion in
most
ases. Noti
e also that for problems with many variables, RVErS runs as fast or faster
than the lter. Additional experiments along these lines would be instru
tive.
Overtting is sometimes a problem with greedy variable sele
tion algorithms. Figures 4
and 5 show both the test and inner (training)
ross-validation error rates for the sele
tion
algorithms on naive Bayes and C4.5 respe
tively. Solid lines indi
ate test error, while dashed
lines indi
ate the inner
ross-validation error. Noti
e that the test error is not always
1355
RVErS rmax=50
RVErS rmax=100
7
Test
CV
6.5
6
5.5
Error
Error
Test
CV
6.5
5.5
5
4.5
4.5
4
3.5
4
0
RVErS rmax=180
RVErS k=1
7
Test
CV
6.5
6
5.5
Error
Error
Test
CV
6.5
5.5
5
4.5
4.5
4
3.5
4
0
100
200 300
Iterations
400
500
40
35
30
25
20
15
10
5
0
Backward
7
Test
CV
Test
CV
6.5
6
Error
Error
Forward
5.5
5
4.5
4
8 10 12 14 16 18
Iterations
10 15 20 25
Iterations
1356
30 35
RVErS rmax=50
RVErS rmax=100
13
Test
CV
12
10
Error
Error
11
9
8
7
6
0
10.5
10
9.5
9
8.5
8
7.5
7
6.5
Test
CV
RVErS rmax=180
RVErS k=1
11
Test
CV
10
Error
Error
9
8
7
6
5
0
10.5
10
9.5
9
8.5
8
7.5
7
6.5
6
Test
CV
40
35
30
25
20
15
10
5
0
Backward
11
Test
CV
Test
CV
10
9
Error
Error
Forward
8
7
6
5
4
6
8
Iterations
10
12
20
40
60 80
Iterations
1357
100 120
minimized with the nal sele
tions produ
ed by RVErS. The graphs show that RVErS does
tend to overt naive Bayes, but not C4.5 (or at least to a lesser extent). Tra
e data from
the other data sets agree with this
on
lusion.
There are at least two possible explanations for overtting by RVErS. One is that the
toleran
e level either
auses the algorithm to
ontinue eliminating variables when it should
stop, or allows elimination of relevant variables. In either
ase, a better adjusted toleran
e
level should improve performan
e. The monks-2 data set provides an example. In this
ase,
if the toleran
e is set to zero, RVErS reliably nds variable subsets that produ
e low-error
hypotheses with C4.5.
A se
ond explanation is that the stopping
riteria, whi
h be
omes more di
ult to
satisfy as the algorithm progresses,
auses the elimination pro
ess to be
ome overzealous.
In this
ase the solution may be to augment the given stop
riteria with a hold-out data set
(in addition to the validation set). Here the algorithm monitors performan
e in addition
to
ounting
onse
utive failures, returning the best sele
tion, rather than simply the last.
Combining this overtting result with the above performan
e results suggests that RVErS
is
apable of performing quite well with respe
t to both generalization and speed.
8. Dis
ussion
The speed of randomized variable elimination stems from two aspe
ts of the algorithm.
One is the use of large steps in moving through the sear
h spa
e of variable sets. As the
number of irrelevant variables grows, and the probability of sele
ting a relevant variable at
random shrinks, RVE attempts to take larger steps toward its goal of identifying all of the
irrelevant variables. In the fa
e of many irrelevant variables, this is a mu
h easier task than
attempting to identify the relevant variables.
The se
ond sour
e of speed in RVE is the approa
h of removing variables immediately,
instead of nding the best variable (or set) to remove. This is mu
h less
onservative
than the approa
h taken by step-wise algorithms, and a
ounts for mu
h of the benet
of RVE. In pra
ti
e, the full benet of removing multiple variables simultaneously may
only be beginning to materialize in the data sets used here. However, we expe
t that as
domains s
ale up, multiple sele
tions will be
ome in
reasingly important. One example of
this o
urs in the STL algorithm (Utgo and Stra
uzzi, 2002), whi
h learns many
on
epts
over a period of time. There, the number of available input variables grows as more
on
epts
are learned by the system.
Consider brie
y the
ost of forward sele
tion wrapper algorithms. Greedy step-wise
sear
h is bounded by O(rNM (L; r)) for forward sele
tion and O(N (N r)M (L; N )) for
ba
kward elimination, provided it does not ba
ktra
k or remove (or add) previously added
(or removed) variables. The bound on the ba
kward approa
h re
e
ts both the larger number of steps required to remove the irrelevant variables and the larger number of variables
used at ea
h
all to the learner. The
ost of training ea
h hypothesis is small in the forward
greedy approa
h
ompared to RVE, sin
e the number of inputs to any given hypothesis is
mu
h smaller (bounded roughly by r). However, the number of
alls to the learning algorithm is polynomial in N . As the number of irrelevant variables in
reases, even a forward
greedy approa
h to variable sele
tion be
omes qui
kly unmanageable.
1358
The
ost of a best-rst sear
h using
ompound operators (Kohavi and John, 1997) is
somewhat harder to analyze. Their approa
h
ombines the two best operators (e.g. add
variable or remove variable) and then
he
ks whether the result is an improvement. If
so, the resulting operator is
ombined with the next best operator and tested,
ontinuing
until there is no improvement. Theoreti
ally this type of sear
h
ould nd a solution using
approximately 2r forward evaluations or 2(N r) ba
kward subset evaluations. However,
this would require the algorithm to make the
orre
t
hoi
e at every step. The experimental
results (Kohavi and John, 1997) suggest that in pra
ti
e the algorithm requires many more
subset evaluations than this minimum.
Compare the above bounds on forward and ba
kward greedy sear
h to that of RVE given
a xed k = 1, whi
h is O(rNM (L; N )). Noti
e that the number of
alls to the learning
algorithm is the same for RVE with xed k and a greedy forward sear
h (the
ost of learning
is dierent however). The empiri
al results support the
on
lusion that the two algorithms
produ
e similar
ost, but also show that RVE with k = 1 requires less CPU time. The sour
e
of this additional e
onomy is un
lear, although it may be related to various overhead
osts
asso
iated with the learning algorithms. RVE requires many fewer total learner exe
utions,
thereby redu
ing overhead.
In pra
ti
e, the k = 1 version of RVErS often makes fewer than rN
alls to the learning
algorithm. This follows from the very high probability of a su
essful sele
tion of an irrelevant variable at ea
h step. In
ases when N is mu
h larger than r, the algorithm with k = 1
makes roughly N
alls to the learner as shown in Tables 6 and 7. Additional e
onomy may
also be possible when k is xed at one. Ea
h variable should only need to be tested on
e,
allowing RVErS to make exa
tly N
alls to the learner. Further experiments are needed to
onrm this intuition.
Although the RVE algorithm using a xed k = 1 is signi
antly more expensive than
the optimal RVE or RVErS using a good guess for rmax , experiments and analysis show
that this simple algorithm is generally faster than the deterministi
forward or ba
kward
approa
hes, provided that there are enough irrelevant variables in the domain. As the ratio
r=N de
reases, and the probability of sele
ting an irrelevant variable at random in
reases,
the benet of a randomized approa
h improves. Thus, even when no information about
the number of relevant variables is available, a randomized, ba
kward approa
h to variable
sele
tion may be bene
ial.
A disadvantage to randomized variable sele
tion is that there is no
lear way to re
over from poor
hoi
es. Step-wise sele
tion algorithms sometimes
onsider both adding
and removing variables at ea
h step, so that no variable is ever permanently sele
ted or
eliminated. A hybrid version of RVErS whi
h
onsiders adding a single variable ea
h time
a set a variables is eliminated is possible, but this would ultimately negate mu
h of the
algorithm's
omputational benet.
Step-wise sele
tion algorithms are sometimes parallelized in order to speed the sele
tion
pro
ess. This is due in large part to the very high
ost of step-wise sele
tion. RVE mitigates
this problem to a point, but there is no obvious way to parallelize a randomized sele
tion
algorithm. Parallelization
ould be used to improve generalization performan
e by allowing
the algorithm to evaluate several subsets simultaneously and then
hoose the best.
1359
9. Future Work
There are at least three possible dire
tions for future work with RVE. The rst is an improved method for
hoosing k when r is unknown. We have presented an algorithm based
on a binary sear
h, but RVErS still wastes a great deal of time de
iding when to terminate
the sear
h, and
an qui
kly degenerate into a one-at-a-time removal strategy if bad de
isions are made early in the sear
h. Noti
e however, that this worst-
ase performan
e is still
better than stepwise ba
kward elimination, and
omparable to stepwise forward sele
tion,
both popular algorithms.
A se
ond dire
tion for future work involves further study of the ee
t of testing very
few of the possible su
essors to the
urrent sear
h node. Testing all possible su
essors is
the sour
e of the high
ost of most wrapper methods. If a sparse sear
h, su
h as that used
by RVE, does not sa
ri
e mu
h quality in general, then other fast wrapper algorithms may
be possible.
A third possible dire
tion involves biasing the random sele
tions at ea
h step. If a set of
k variables fails to maintain evaluation performan
e, then at least one of the k must have
been relevant to the learning problem. Thus, variables in
luded in a failed sele
tion may be
viewed as more likely to be relevant. This \relevan
e likelihood"
an be tra
ked throughout
the elimination pro
ess and used to bias sele
tions at ea
h step.
10. Con
lusion
The randomized variable elimination algorithm uses a two-step pro
ess to remove irrelevant
input variables. First, a sequen
e of values for k, the number input variables to remove
at ea
h step, is
omputed su
h that the
ost of removing all N r irrelevant variables
is minimized. The algorithm then removes the irrelevant variables by randomly sele
ting
inputs for removal a
ording to the
omputed s
hedule. Ea
h step is veried by generating
and testing a hypothesis to ensure that the new hypothesis is at least as good as the
existing hypothesis. A randomized approa
h to variable elimination that simultaneously
removes multiple inputs produ
es a fa
tor N speed-up over approa
hes that remove inputs
individually, provided that the number r of relevant variables is known in advan
e.
When number of relevant variables is not known, a sear
h for r may be
ondu
ted in
parallel with the sear
h for irrelevant variables. Although this approa
h wastes some of the
benets generated by the theoreti
al algorithm, a reasonable upper bound on the number of
relevant variables still produ
es good performan
e. When even this weaker
ondition
annot
be satised, a randomized approa
h may still outperform the
onventional deterministi
wrapper approa
hes provided that the number of relevant variables is small
ompared to
the total number of variables. A randomized approa
h to variable sele
tion is therefore
appli
able whenever the target domain is believed to have many irrelevant variables.
Finally, we
on
lude that an expli
it sear
h through the spa
e of variable subsets is not
ne
essary to a
hieve good performan
e from a wrapper algorithm. Randomized variable
elimination provides
ompetitive performan
e without in
urring the high
ost of expanding
and evaluating all su
essors of a sear
h node. As a result, randomized variable elimination
s
ales well beyond
urrent wrapper algorithms for variable sele
tion.
1360
A knowledgments
The authors thank Bill Hesse for his advi
e
on
erning the analysis of RVE. This material is
based upon work supported by the National S
ien
e Foundation under Grant No. 0097218.
Any opinions, ndings, and
on
lusions or re
ommendations expressed in this material are
those of the author(s) and do not ne
essarily re
e
t the views of the National S
ien
e
Foundation.
Referen
es
D. W. Aha and R. L. Bankert. Feature sele
tion for
ase-based
lassi
ation of
loud types:
An empiri
al
omparision. In Working Notes of the AAAI-94 Workshop on Case-Based
Reasoning, pages 106{112, Seattle, WA, 1994. AAAI Press.
H. Almuallim and T. G. Dietteri
h. Learning with many irrelevant features. In Pro
eedings
of the Ninth National Conferen
e on Arti
ial Intelligen
e, Anaheim, CA, 1991. MIT
Press.
C. L. Blake and C. J. Merz. U
i repository of ma
hine learning databases. Te
hni
al report,
University of California, Department of Information and Computer S
ien
e, 1998.
C. Cardie. Using de
ision trees to improve
ase-based learning. In Ma
hine Learning: Pro
eedings of the Tenth International Conferen
e, Amherst, MA, 1993. Morgan Kaufmann.
R. Caruana and D. Freitag. Greedy attribute sele
tion. In Ma
hine Learning: Pro
eedings
of the Eleventh International Conferen
e, New Brunswi
k, NJ, 1994. Morgan Kaufmann.
K. J. Cherkauer and J. W. Shavlik. Growing simpler de
ision trees to fa
ilitate knowledge
dis
overy. In Pro
eedings of the Se
ond International Conferen
e on Knowledge Dis
overy
and Data Mining. AAAI Press, 1996.
W. W. Cooley and P. R. Lohnes. Multivariate data analysis. Wiley, New York, 1971.
P. A. Devijver and J. Kittler. Pattern re
ognition: A statisti
al approa
h. Prenti
e
Hall/International, 1982.
P. Domingos. Context sensitive feature sele
tion for lazy learners. Arti
ial Intelligen
e
Review, 11:227{253, 1997.
G.H. Dunteman. Prin
ipal
omponents analysis. Sage Publi
ations, In
., Newbury Park
CA, 1989.
B. Efron, T. Hastie, I. Johnstone, and R. Tibsharini. Least angle regression. Te
hni
al
Report TR-220, Stanford University, Department of Statisti
s, 2003.
S. E. Fahlman and C. Lebiere. The
as
ade-
orrelation learning ar
hite
ture. Advan
es in
Neural Information Pro
essing Systems, 2:524{532, 1990.
W. Finno, F. Hergert, and H. G. Zimmermann. Improving model sele
tion by non
onvergent methods. Neural Networks, 6:771{783, 1993.
1361
M. Frean. A \thermal" per
eptron learning rule. Neural Computation, 4(6):946{957, 1992.
J. H. Friedman and J. W. Tukey. A proje
tion pursuit algorithm for exploratory data
analysis. IEEE Transa
tions on Computers, C-23(9):881{889, 1974.
S. I. Gallant. Per
eptron-based learning. IEEE Transa
tions on Neural Networks, 1(2):
179{191, 1990.
D. Goldberg. Geneti
algorithms in sear
h, optimization, and ma
hine learning. AddisonWesley, Reading, MA, 1989.
M. A. Hall. Correlation-based feature sele
tion for ma
hine learning. PhD thesis, Department of Computer S
ien
e, University of Waikato, Hamilton, New Zealand, 1999.
B. Hassibi and D. G. Stork. Se
ond order derivatives for network pruning: Optimal brain
surgeon. In Advan
es in Neural Information Pro
essing Systems 5. Morgan Kaufmann,
1993.
I. Inza, P. Larranaga, R. Etxeberria, and B. Sierra. Feature subset sele
tion by Bayesian
network-based optimization. Arti
ial Intelligen
e, 123(1-2):157{184, 2000.
G. H. John, R. Kohavi, and K. P
eger. Irrelevant features and the subset sele
tion problem.
In Ma
hine Learning: Pro
eedings of the Eleventh International Conferen
e, pages 121{
129, New Brunswi
k, NJ, 1994. Morgan Kaufmann.
R. Kerber. Chimerge: Dis
retization of numeri
attributes. In Pro
eedings of the Tenth
National Conferen
e on Arti
ial Intelligen
e, pages 123{128, San Jose, CA, 1992. MIT
Press.
R. King, C. Feng, and A. Shutherland. Statlog: Comparison of
lassi
ation algorithms on
large real-world problems. Applied Arti
ial Intelligen
e, 9(3):259{287, 1995.
K. Kira and L. Rendell. A pra
ti
al approa
h to feature sele
tion. In D. Sleeman and
P. Edwards, editors, Ma
hine Learning: Pro
eedings of the Ninth International Conferen
e, San Mateo, CA, 1992. Morgan Kaufmann.
J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear
predi
tion. Information and Computation, 132(1):1{64, 1997.
R. Kohavi. Wrappers for performan
e enhan
ement and oblivious de
ision graphs. PhD
thesis, Department of Computer S
ien
e, Stanford University, Stanford, CA, 1995.
R. Kohavi and G. H. John. Wrappers for feature subset sele
tion. Arti
ial Intelligen
e, 97
(1-2):273{324, 1997.
D. Koller and M. Sahami. Toward optimal feature sele
tion. In Ma
hine Learning: Pro
eedings of the Fourteenth International Conferen
e, pages 284{292. Morgan Kaufmann,
1996.
1362
M. Kubat, D. Flotzinger, and G. Pfurts
heller. Dis
overing patterns in Eeg signals: Comparative study of a few methods. In Pro
eedings of the European Conferen
e on Ma
hine
Learning, pages 367{371, 1993.
P. Langley and S. Sage. Oblivious de
ision trees and abstra
t
ases. In Working Notes of
the AAAI-94 Workshop on Case-Based Reasoning, Seattle, WA, 1994. AAAI Press.
Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. In Advan
es in Neural Information Pro
essing Systems 2, pages 598{605, San Mateo, CA, 1990. Morgan Kaufmann.
N. Littlestone. Learning qui
kly when irrelevant attributes abound: A new linear-threshold
algorithm. Ma
hine Learning, 1988.
H. Liu and R. Setino. A probabilisti
approa
h to feature sele
tion: A lter solution.
In Ma
hine Learning: Pro
eedings of the Fourteenth International Conferen
e. Morgan
Kaufmann, 1996.
H. Liu and R. Setiono. Feature sele
tion via dis
retization. IEEE Transa
tion on Knowledge
and Data Engineering, 9(4):642{645, 1997.
O. Maron and A. W. Moore. Hoeding ra
es: A
elerating model sele
tion sear
h for
lassi
ation and fun
tion approximation. In Advan
es in Neural Information Pro
essing
Systems, volume 6. Morgan Kaufmann, 1994.
M. Mit
hell. An introdu
tion to geneti
algorithms. MIT Press, Cambridge, MA, 1996.
M. Mit
hell, T. Ma
hine learning. MIT Press, 1997.
J. Moody and J. Utans. Ar
hite
ture sele
tion for neural networks: Appli
ation to
orporate
bond rating predi
tion. In A. N. Refenes, editor, Neural Networks in the Capital Markets.
John Wiley and Sons, 1995.
J. R. Quinlan. Indu
tion of de
ision trees. Ma
hine Learning, 1(1):81{106, 1986.
J. R. Quinlan. C4.5: Ma
hine learning programs. Morgan Kaufmann, 1993.
F. Rosenblatt. The per
eptron: A probabilisti
model for information storage and orginization in the brain. Psy
hologi
al Review, 65:386{407, 1958.
D. E. Rumelhart and J. L. M
Clelland. Parallel distributed pro
essing. MIT Press, Cambridge, MA, 1986. 2 volumes.
L. Thurstone. Multivariate data analysis. Psy
hologi
al Review, 38:406{427, 1931.
P. E. Utgo and D. J. Stra
uzzi. Many-layered learning. Neural Computation, 14(10):
2497{2529, 2002.
H. Vafaie and K. De Jong. Geneti
algorithms as a tool for restru
turing feature spa
e
representations. In Pro
eedings of the International Conferen
e on Tools with AI. IEEE
Computer So
iety Press, 1995.
1363
1364