Sunteți pe pagina 1din 34

Journal of Ma

hine Learning Resear h 5 (2004) 1331{1362

Submitted 11/02; Published 10/04

Randomized Variable Elimination


David J. Stra uzzi
Paul E. Utgo

stra udj s.umass.edu


utgoff s.umass.edu

Department of Computer S ien e


University of Massa husetts at Amherst
140 Governors Drive
Amherst, MA 01003

Editor:

Haym Hirsh
Abstra t

Variable sele tion, the pro ess of identifying input variables that are relevant to a parti ular
learning problem, has re eived mu h attention in the learning ommunity. Methods that
employ a learning algorithm as a part of the sele tion pro ess (wrappers) have been shown
to outperform methods that sele t variables independently from the learning algorithm
( lters), but only at great omputational expense. We present a randomized wrapper
algorithm whose omputational requirements are within a onstant fa tor of simply learning
in the presen e of all input variables, provided that the number of relevant variables is
small and known in advan e. We then show how to remove the latter assumption, and
demonstrate performan e on several problems.
1. Introdu tion

When learning in a supervised environment, a learning algorithm is typi ally presented


with a set of N -dimensional data points, ea h with its asso iated target output. The
learning algorithm then outputs a hypothesis des ribing the fun tion underlying the data.
In pra ti e, the set of N input variables is arefully sele ted by hand in order to improve
the performan e of the learning algorithm in terms of both learning speed and hypothesis
a ura y.
In some ases there may be a large number of inputs available to the learning algorithm,
few of whi h are relevant to the target fun tion, with no opportunity for human intervention. For example, feature dete tors may generate a large number of features in a pattern
re ognition task. A se ond possibility is that the learning algorithm itself may generate a
large number of new on epts (or fun tions) in terms of existing on epts. Valiant (1984),
Fahlman and Lebiere (1990), and Kivinen and Warmuth (1997) all dis uss situations in
whi h a potentially large number of features are reated during the learning pro ess. In
these situations, an automati approa h to variable sele tion is required.
One approa h to variable sele tion that has produ ed good results is the wrapper method
(John et al., 1994). Here, a sear h is performed in the spa e of variable subsets, with
the performan e of a spe i learning algorithm based on su h a subset serving as an
evaluation fun tion. Using the a tual generalization performan e of the learning algorithm
as an evaluation metri allows this approa h to sear h for the most predi tive set of input

2004 David J. Stra uzzi and Paul E. Utgo .

Stra uzzi and Utgoff

variables with respe t to the learner. However, exe uting the learning algorithm for ea h
sele tion of variables during the sear h ultimately renders the approa h intra table in the
presen e of many irrelevant variables.
In spite of the ost, variable sele tion an play an important role in learning. Irrelevant
variables an often degrade the performan e of a learning algorithm, parti ularly when data
are limited. The main omputational ost asso iated with the wrapper method is usually
that of exe uting the learning algorithm. The learner must produ e a hypothesis for ea h
subset of the input variables. Even greedy sele tion methods (Caruana and Freitag, 1994)
that ignore large areas of the sear h spa e an produ e a large number of andidate variable
sets in the presen e of many irrelevant variables.
Randomized variable elimination avoids the ost of evaluating many variable sets by
taking large steps through the spa e of possible input sets. The number of variables eliminated in a single step depends on the number of urrently sele ted variables. We present
a ost fun tion whose purpose is to strike a balan e between the probability of failing to
sele t su essfully a set of irrelevant variables and the ost of running the learning algorithm
many times. We use a form of ba kward elimination approa h to simplify the dete tion of
relevant variables. Removal of any relevant variable should immediately ause the learner's
performan e to degrade. Ba kward elimination also simpli es the sele tion pro ess when
irrelevant variables are mu h more ommon than relevant variables, as we assume here.
Analysis of our ost fun tion shows that the ost of removing all irrelevant variables is
dominated by the ost of simply learning with all N variables. The total ost is therefore
within a onstant fa tor of the ost of simply learning the target fun tion based on all
N input variables, provided that the ost of learning grows at least polynomially in N .
The bound on the omplexity of our algorithm is based on the omplexity of the learning
algorithm being used. If the given learning algorithm exe utes in time O(N 2 ), then removing
the N r irrelevant variables via randomized variable elimination also exe utes in time
O(N 2 ). This is a substantial improvement ompared to the fa tor N or more in rease
experien ed in removing inputs one at a time.
2. Variable Sele tion

The spe i problem of variable sele tion is the following: Given a large set of input variables
and a target on ept or fun tion, produ e a subset of the original input variables that
predi t best the target on ept or fun tion when ombined into a hypothesis by a learning
algorithm. The term \predi t best" may be de ned in a variety of ways, depending on the
spe i appli ation and sele tion algorithm. Ideally the produ ed subset should be as small
as possible to redu e training osts and help prevent over tting.
From a theoreti al viewpoint, variable sele tion should not be ne essary. For example,
the predi tive power of Bayes rule in reases monotoni ally with the number of variables.
More variables should always result in more dis riminating power, and removing variables
should only hurt. However, optimal appli ations of Bayes rule are intra table for all but the
smallest problems. Many ma hine learning algorithms perform sub-optimal operations and
do not onform to the stri t onditions of Bayes rule, resulting in the potential for a performan e de line in the fa e of unne essary inputs. More importantly, learning algorithms
usually have a ess to a limited number of examples. Unrelated inputs require additional
1332

Randomized Variable Elimination

apa ity in the learner, but do not bring new information in ex hange. Variable sele tion
is thus a ne essary aspe t of indu tive learning.
A variety of approa hes to variable sele tion have been devised. Most methods an be
pla ed into one of two ategories: lter methods or wrapper methods. Filter approa hes
perform variable sele tion independently of the learning algorithm, while wrappers make
learner-dependent sele tions. A third group of spe ial purpose methods perform feature
sele tion in the ontext of neural networks, known as parameter pruning. These methods annot dire tly perform variable sele tion for arbitrary learning algorithms; they are
approa hes to removing irrelevant inputs from learning elements.
Many variable sele tion algorithms (although not all) perform some form of sear h in the
spa e of variable subsets as part of their operation. A forward sele tion algorithm begins
with the empty set and sear hes for variables to add. A ba kward elimination algorithm
begins with the set of all variables and sear hes for variables to remove. Optionally, forward
algorithms may o asionally hoose to remove variables, and ba kward algorithms may
hoose to add variables. This allows the sear h to re over from previous poor sele tions.
The advantage of forward sele tion is that, in the presen e of many irrelevant variables, the
size of the subsets will remain relatively small, helping to speed evaluation. The advantage of
ba kward elimination is that re ognizing irrelevant variables is easier. Removing a relevant
variable from an otherwise omplete set should ause a de line in the evaluation, while
adding a relevant variable to an in omplete set may have little immediate impa t.

2.1 Filters
Filter methods use statisti al measures to evaluate the quality of the variable subsets. The
goal is to nd a set of variables that is best with respe t to the spe i quality measure.
Determining whi h variables to in lude may either be done via an expli it sear h in the
spa e of variable subsets, or by numeri ally weighting the variables individually and then
sele ting those with the largest weight. Filter methods often have the advantage of speed.
The statisti al measures used to evaluate variables typi ally require very little omputation
ompared to ost of running a learning algorithm many times. The disadvantage is that
variables are evaluated independently, not in the ontext of the learning problem.
Early ltering algorithms in lude FOCUS (Almuallim and Dietteri h, 1991) and Relief
(Kira and Rendell, 1992). FOCUS sear hes for a smallest set of variables that an ompletely
dis riminate between target lasses, while Relief ranks variables a ording to a distan e
metri . Relief sele ts training instan es at random when omputing distan e values. Note
that this is not related to our approa h of sele ting variables at random.
De ision trees have also been employed to sele t input variables by rst indu ing a tree,
and then sele ting only those variables tested by de ision nodes (Cardie, 1993; Kubat et al.,
1993). In another vein, Koller and Sahami (1996) dis uss a variable sele tion algorithm
based on ross entropy and information theory.
Methods from statisti s also provide a basis for a variety of variable ltering algorithms.
Correlation-based feature sele tion (CFS) (Hall, 1999) attempts to nd a set of variables that
are ea h highly orrelated with the target fun tion, but not with ea h other. The ChiMerge
(Kerber, 1992) and Chi2 algorithms (Liu and Setiono, 1997) remove both irrelevant and
redundant variables using a 2 test to merge adja ent intervals of ordinal variables.
1333

Stra uzzi and Utgoff

Other methods from statisti s solve problems losely related to variable sele tion. For
example, prin ipal omponent analysis (see Dunteman, 1989) is a method for transforming
the observed variables into a smaller number of dimensions, as opposed to removing irrelevant or redundant variables. Proje tion pursuit (Friedman and Tukey, 1974) and fa tor
analysis (Thurstone, 1931) (see Cooley and Lohnes, 1971, for a detailed presentation) are
used both to redu e dimensionality and to dete t stru ture in relationships among variables.
Dis ussion of ltering methods for variable sele tion also arises in the pattern re ognition
literature. For example, Devijver and Kittler (1982) dis uss the use of a variety of linear and
non-linear distan e measures and separability measures su h as entropy. They also dis uss
several sear h algorithms, su h as bran h and bound and plus l-take away r. Bran h and
bound is an optimal sear h te hnique that relies on a areful ordering of the sear h spa e
to avoid an exhaustive sear h. Plus l-take away r is more akin to the standard forward and
ba kward sear h. At ea h step, l new variables are sele ted for in lusion in the urrent set
and r existing variables are removed.

2.2 Wrappers
Wrapper methods attempt to tailor the sele tion of variables to the strengths and weaknesses
of spe i learning algorithms by using the performan e of the learner to evaluate subset
quality. Ea h andidate variable set is evaluated by exe uting the learning algorithm given
the sele ted variables and then testing the a ura y of the resulting hypotheses. This
approa h has the advantage of using the a tual hypothesis a ura y as a measure of subset
quality. The problem is that the ost of repeatedly exe uting the learning algorithm an
qui kly be ome prohibitive. Nevertheless, wrapper methods do tend to outperform lter
methods. This is not surprising given that wrappers evaluate variables in the ontext of the
learning problem, rather than independently.
2.2.1 Algorithms

John, Kohavi, and P eger (1994) appear to have oined the term \wrapper" while resear hing the method in onjun tion with a greedy sear h algorithm, although the te hnique has a
longer history (Devijver and Kittler, 1982). Caruana and Freitag (1994) also experimented
with greedy sear h methods for variable sele tion. They found that allowing the sear h
to either add variables or remove them at ea h step of the sear h improved over simple
forward and ba kward sear hes. Aha and Bankert (1994) use a ba kward elimination beam
sear h in onjun tion with the IB1 learner, but found no eviden e to prefer this approa h
to forward sele tion. OBLIVION (Langley and Sage, 1994) sele ts variables for the nearest
neighbor learning algorithm. The algorithm uses a ba kward elimination approa h with a
greedy sear h, terminating when the nearest neighbor a ura y begins to de line.
Subsequent work by Kohavi and John (1997) used forward and ba kward best- rst sear h
in the spa e of variable subsets. Sear h operators generally in lude adding or removing a
single variable from the urrent set. This approa h is apable of produ ing a minimal
set of input variables, but the ost grows exponentially in the fa e of many irrelevant
variables. Compound operators generate nodes deep in the sear h tree early in the sear h
by ombining the best hildren of a given node. However, the ost of running the best- rst
sear h ultimately remains prohibitive in the presen e of many irrelevant variables.
1334

Randomized Variable Elimination

Hoe ding ra es (Maron and Moore, 1994) take a di erent approa h. All possible models
(sele tions) are evaluated via leave-one-out ross validation. For ea h of the N evaluations,
an error on den e interval is established for ea h model. Models whose error lower bound
falls below the upper bound of the best model are dis arded. The result is a set of models
whose error is insigni antly di erent.
Several algorithms for onstru ting regression models are also forms of wrapper methods.
For example, Least angle regression (Efron et al., 2003), whi h generalizes and improves
upon several forward sele tion regression algorithms, adds variables to the model in rementally.
Geneti algorithms have been also been applied as a sear h me hanism for variable
sele tion. Vafaie and De Jong (1995) des ribe using a geneti algorithm to perform variable
sele tion. They used a straightforward representation in whi h individual hromosomes were
bit-strings with ea h bit marking the presen e or absen e of a spe i variable. Individuals
were evaluated by training and then testing the learning algorithm. In a similar vein, SETGen (Cherkauer and Shavlik, 1996) used a tness (evaluation) fun tion that in luded both
the a ura y of the indu ed model and the omprehensibility of the model. The learning
model used in their experiments was a de ision tree and omprehensibility was de ned as
a ombination of tree size and number of features used. The FSS-EBNA algorithm (Inza
et al., 2000) used Bayesian Networks to mate individuals in a GA-based approa h to variable
sele tion.
The relevan e-in- ontext (RC) algorithm (Domingos, 1997) is based on the idea that
some features may only be relevant in parti ular areas of the instan e spa e for instan e
based (lazy) learners. Clusters of training examples are formed by nding examples of the
same lass with nearly equivalent feature ve tors. The features along whi h the examples
di er are removed and the a ura y of the entire model is determined. If the a ura y
de lined, the features are restored and the failed examples are removed from onsideration.
The algorithm ontinues until there are no more examples to onsider. Results showed that
RC outperformed other wrapper methods with respe t to a 1-NN learner.
2.2.2 Learner Sele tions

Many learning algorithms already ontain some (possibly indire t) form of variable sele tion,
su h as pruning in de ision trees. This raises the question of whether the variable sele tions
made by the learner should be used by the wrapper. Su h an approa h would almost
ertainly run faster than methods that rely only on the wrapper to make variable sele tions.
The wrapper sele ts variables for the learner, and then exe utes the learner. If the resulting
hypothesis is an improvement, then the wrapper further removes all variables not used in
the hypothesis before ontinuing on with the next round of sele tions.
This approa h assumes the learner is apable of making bene ial variable sele tions.
If this were true, then both lter and wrapper methods would be largely irrelevant. Even
the most sophisti ated learning algorithms may perform poorly in the presen e of highly
orrelated, redundant or irrelevant variables. For example, John, Kohavi, and P eger (1994)
and Kohavi (1995) both demonstrate how C4.5 (Quinlan, 1993) an be tri ked into making
bad de isions at the root. Variables highly orrelated with the target value, yet ultimately
useless in terms of making bene ial data partitions, are sele ted near the root, leading to
1335

Stra uzzi and Utgoff

unne essarily large trees. Moreover, these bad de isions annot be orre ted by pruning.
Only variable sele tion performed outside the ontext of the learning algorithm an re ognize
these types of orrelated, irrelevant variables.
2.2.3 Estimating Performan e

One question that any wrapper method must onsider is how to obtain a good estimate of
the a ura y of the learner's hypothesis. Both the amount and quality of data available
to the learner a e t the testing a ura y. Kohavi and John (1997) suggest using multiple
runs of ve-fold ross-validation to obtain an error estimate. They determine the number
of ross-validation runs by ontinuing until the standard deviation of the a ura y estimate
is less than 1%. This has the ni e property of (usually) requiring fewer runs for large data
sets. However, in general, ross-validation is an expensive pro edure, requiring the learner
to produ e several hypotheses for ea h sele tion of variables.

2.3 Model Spe i Methods


Many learning algorithms have built-in variable (or parameter) sele tion algorithms whi h
are used to improve generalization. As noted above, de ision tree pruning is one example
of built-in variable sele tion. Conne tionist algorithms provide several other examples,
known as parameter pruning. As in the more general variable sele tion problem, extra
weights (parameters) in a network an degrade the performan e of the network on unseen
test instan es, and in rease the ost of evaluating the learned model. Parameter pruning
algorithms often su er the same disadvantages as tree pruning. Poor hoi es made early in
the learning pro ess an not usually be undone.
One method for dealing with unne essary network parameters is weight de ay (Werbos,
1988). Weights are onstantly pushed toward zero by a small multipli itive fa tor in the
update rule. Only the parameters relevant to the problem re eive su iently large weight
updates to remain signi ant. Methods for parameter pruning in lude the optimal brain
damage (OBD) (LeCun et al., 1990) and optimal brain surgeon (OBS) (Hassibi and Stork,
1993) algorithms. Both rely on the se ond derivative to determine the importan e of onne tion weights. Sensitivity-based pruning (Moody and Utans, 1995) evaluates the e e t of
removing a network input by repla ing the input by its mean over all training points. The
autoprune algorithm (Finno et al., 1993) de nes an importan e metri for weights based
on the assumption that irrelevant weights will be ome zero. Weights with a low metri
value are onsidered unimportant and are removed from the network.
There are also onne tionist approa hes that spe ialize in learning qui kly in the presen e
irrelevant inputs, without a tually removing them. The WINNOW algorithm (Littlestone,
1988) for Boolean fun tions and the exponentiated gradient algorithm (Kivinen and Warmuth, 1997) for real-valued fun tions are apable of learning linearly separable fun tions
e iently in the presen e of many irrelevant variables. Exponentiated gradient algorithms,
of whi h WINNOW is a spe ial ase, are similar to gradient des ent algorithms, ex ept that
the updates are multipli ative rather than additive.
The result is a mistake bound that is linear in the number of relevant inputs, but only
logarithmi in the number of irrelevant inputs. Kivinen and Warmuth also observed that
the number of examples required to learn an a urate hypothesis also appears to obey
1336

Randomized Variable Elimination

these bounds. In other words, the number of training examples required by exponentiated
gradient algorithms grows only logarithmi ly in the number of irrelevant inputs.
Exponentiated gradient algorithms may be applied to the problem of separating the set
of relevant variables from irrelevant variables by running them on the available data and
examining the resulting weights. Although exponentiated gradient algorithms produ e a
minimum error t of the data in non-separable problems, there is no guarantee that su h a
t will rely on the variables relevant to a non-linear t.
Many algorithms that are dire tly appli able in non-linear situations experien e a performan e de line in the presen e of irrelevant input variables. Even support ve tor ma hines,
whi h are often touted as impervious to irrelevant variables, have been shown to improve
performan e with feature sele tion (Weston et al., 2000). A more general approa h to
re ognizing relevant variables is needed.
3. Setting

Our algorithm for randomized variable elimination (RVE) requires a set (or sequen e) of
N -dimensional ve tors xi with labels yi . The learning algorithm L is asked to produ e a
hypothesis h based only on the inputs xij that have not been marked as irrelevant (alternatively, a prepro essor ould remove variables marked irrelevant). We assume that the
hypotheses bear some relation to the data and input values. A degenerate learner (su h as
one that produ es the same hypothesis regardless of data or input variables) will in pra ti e ause the sele tion algorithm ultimately to sele t zero variables. This is true of most
wrapper methods. For the purposes of this arti le, we use generalization a ura y as the
performan e riteria, but this is not a requirement of the algorithm.
We make the assumption that the number r of relevant variables is at least two to avoid
degenerate ases in our analysis. The number of relevant variables should be small ompared
to the total number of variables N . This ondition is not riti al to the fun tionality of the
RVE algorithm; however the bene t of using RVE in reases as the ratio of N to r in reases.
Importantly, we assume that the number of relevant variables is known in advan e, although
whi h variables are relevant remains hidden. Knowledge of r is a very strong assumption
in pra ti e, as su h information is not typi ally available. We remove this assumption in
Se tion 6, and present an algorithm for estimating r while removing variables.
4. The Cost Fun tion

Randomized variable elimination is a wrapper method motivated by the idea that, in the
presen e of many irrelevant variables, the probability of su essfully sele ting several irrelevant variables simultaneously at random from the set of all variables is high. The algorithm
omputes the ost of attempting to remove k input variables out of n remaining variables
given that r are relevant. A sequen e of values for k is then found by minimizing the aggregate ost of removing all N r irrelevant variables. Note that n represents the number of
remaining variables, while N denotes the total number of variables in the original problem.
The rst step in applying the RVE algorithm is to de ne the ost metri for the given
learning algorithm. The ost fun tion an be based on a variety of metri s, depending on
whi h learning algorithm is used and the onstraints of the appli ation. Ideally, a metri
1337

Stra uzzi and Utgoff

would indi ate the amount of omputational e ort required for the learning algorithm to
produ e a hypothesis.
For example, an appropriate metri for the per eptron algorithm (Rosenblatt, 1958)
might relate to the number of weight updates that must be performed, while the number
of alls to the data purity riterion (e.g. information gain (Quinlan, 1986)) may be a good
metri for de ision tree indu tion algorithms. Sample omplexity represents a metri that
an be applied to almost any algorithm, allowing the ost fun tion to ompute the number of
instan es the learner must see in order to remove the irrelevant variables from the problem.
We do not assume a spe i metri for the de nition and analysis of the ost fun tion.

4.1 De nition
The rst step of de ning the ost fun tion is to onsider the probability

p+ (n; r; k)

Y n r i
=
k 1

n i

i=0

of su essfully sele ting k irrelevant variables at random and without repla ement, given that
there are n remaining and r relevant variables. Next we use this probability to ompute the
expe ted number of onse utive failures before a su ess at sele ting k irrelevant variables
from n remaining given that r are relevant.

E (n; r; k) =

p+ (n; r; k)
p+ (n; r; k)

yields the expe ted number of onse utive trials in whi h at least one of the r relevant
variables will be randomly sele ted along with irrelevant variables prior to su ess.
We now dis uss the ost of sele ting and removing k variables, given n and r. Let
M (L; n) represent an upper bound on the ost of running algorithm L based on n inputs.
In the ase of a per eptron, M (L; n) ould represent an estimated upper bound on the
number of updates performed by an n-input per eptron. In some instan es, su h as a
ba kpropagation neural network (Rumelhart and M Clelland, 1986), providing su h a bound
may be troublesome. In general, the order of the worst ase omputational ost of the
learner with respe t to the number of inputs is all that is needed. The bounding fun tion
should a ount for any assumptions about the nature of the learning problem. For example,
if learning Boolean fun tions requires less omputational e ort than learning real-valued
fun tions, then M (L; n) should in lude this di eren e. The general ost fun tion des ribed
below therefore need not make any additional assumptions about the data.
In order to simplify the notation somewhat, the following dis ussion assumes a xed
algorithm for L. The expe ted ost of su essfully removing k variables from n remaining
given that r are relevant is given by

I (n; r; k) = E (n; r; k)  M (L; n k) + M (L; n k)


= M (L; n k) (E (n; r; k) + 1)
for 1  k  n r. The rst term in the equation denotes the expe ted ost of failures (i.e.
unsu essful sele tions of k variables) while the se ond denotes the ost of the one su ess.
1338

Randomized Variable Elimination

Given this expe ted ost of removing k variables, we an now de ne re ursively the
expe ted ost of removing all n r irrelevant variables. The goal is to minimize lo ally the
expe ted ost of removing k inputs with respe t to the expe ted remaining ost, resulting
in a global minimum expe ted ost for removing all n r irrelevant variables. The use of a
greedy minimization step relies upon the assumption that M (L; n) is monotoni in n. This
is reasonable in the ontext of metri s su h as number of updates, number of data purity
tests, and sample omplexity. The ost (with respe t to learning algorithm L) of removing
n r irrelevant variables is represented by

Isum (n; r) = min(I (n; r; k) + Isum (n k; r)) :


k

The rst part of the minimization term represents the ost of removing the rst k variables
while the se ond part represents the ost of removing the remaining n r k irrelevant
variables. Note that we de ne Isum (r; r) = 0.
The optimal value kopt (n; r) for k given n and r an be determined in a manner similar
to omputing the ost of removing all n r irrelevant inputs. The value of k is omputed
as
kopt (n; r) = arg min(I (n; r; k) + Isum (n k; r)) :
k

4.2 Analysis
The primary bene t of this approa h to variable elimination is that the ombined ost (in
terms of the metri M (L; n)) of learning the target fun tion and removing the irrelevant
input variables is within a onstant fa tor of the ost of simply learning the target fun tion
based on all N inputs. This result assumes that the fun tion M (L; n) is at least a polynomial
of degree j > 0. In ases where M (L; n) is sub-polynomial, running the RVE algorithm
in reases the ost of removing the irrelevant inputs by a fa tor of log(n) over the ost of
learning alone as shown below.
4.2.1 Removing Multiple Variables

We now show that the above average- ase bounds on the performan e of the RVE algorithm
hold. The worst- ase is the unlikely ondition in whi h the algorithm always sele ts a
relevant variable. We assume integer division here for simpli ity. First let k = nr , whi h
allows us to remove the minimization term from the equation for Isum (n; r) and redu es
the number of variables. This value of k is not ne essarily the value sele ted by the above
equations. However, the ost fun tion is omputed via dynami programming, and the
fun tion M (L; n) is assumed monotoni . Any di eren es between our hosen value of k and
the a tual value omputed by the equations an only serve to de rease further the ost of
the algorithm. Note also that, be ause k depends on the number of urrent variables n, k
hanges at ea h iteration of the algorithm.
The probability of su ess p+ (n; r; nr ) is minimized when n = r +1, sin e there is only one
possible su essful sele tion and r possible unsu essful sele tions. This in turn maximizes
the expe ted number of failures E (n; r; nr ) = r. The formula for I (n; r; k) is now rewritten
as
n
n
I (n; r; )  (r + 1)  M (L; n
);

1339

Stra uzzi and Utgoff

where both M (L; n k) terms have been ombined.


The expe ted ost of removing all n r irrelevant inputs may now be rewritten as a
summation

 !!
r lg(n)
X
r 1 i+1
Isum (n; r) 
(r + 1)M L; n
:

i=0

The se ond argument to the learning algorithm's ost metri M denotes the number of variables used at step i of the RVE algorithm. Noti e that this number de reases geometri ally
toward r (re all that n = r is the terminating ondition for the algorithm). The logarithmi
lg(n) lg(r)
fa tor of the upper bound on the summation, lg(1+1
=(r 1))  r lg(n), follows dire tly from
the geometri de rease in the number of variables used at ea h step of the algorithm. The
linear fa tor r follows from the relationship between k and r. In general, as r in reases, k
de reases. Noti e that as r approa hes N , RVE and our ost fun tion degrade into testing
and removing variables individually.
Con luding the analysis, we observe that for fun tions M (L; n) that are at least polynomial in n with degree j > 0, the ost in urred by the rst pass of RVE (i = 0) will
dominate the remainder of the terms. The average- ase ost of running RVE in these ases
is therefore bounded by Isum (N; r)  O(rM (L; N )). An equivalent view is that the sum of a
geometri ally de reasing series onverges to a onstant. Thus, under the stated assumption
that r is small ompared to (and independent of) N , RVE requires only a onstant fa tor
more omputation than the learner alone.
When M (L; n) is sub-linear in n (e.g logarithmi ), ea h pass of the algorithm ontributes signi antly to the total expe ted ost, resulting in an average- ase bound of
O(r2 log(N )M (L; N )). Note that we use average- ase analysis here be ause in the worst
ase the algorithm an randomly sele t relevant variables inde nitely. In pra ti e however,
long streaks of bad sele tions are rare.
4.2.2 Removing Variables Individually

Consider now the ost of removing the N r irrelevant variables one at a time (k = 1).
On e again the probability of su ess is minimized and the expe ted number of failures is
maximized at n = r + 1. The total ost of su h an approa h is given by

Isum (n; r) =

X
n r
i=1

(r + 1)  M (L; n

i) :

Unlike the multiple variable removal ase, the number of variables available to the learner at
ea h step de reases only arithmeti ally, resulting in a linear number of steps in n. This is an
important deviation from the multiple sele tion ase, whi h requires only a logarithmi number of steps. The di eren e between the two methods be omes substantial when N is large.
Con luding, the bound on the average- ase ost of RVE is Isum (N; r)  O(NrM (L; N ))
when k = 1. This is true regardless of whether the variables are sele ted randomly or
deterministi ally at ea h step.
In prin iple, a omparison should be made between the upper bound of the algorithm
that removes multiple variables per step and the lower bound of the algorithm that removes
a single variable per step in order to show the di eren es learly. However, generating
1340

Randomized Variable Elimination

Given: L; N; r
Isum [r + 1::N 0
kopt [r + 1::N 0

for i

r + 1 to N do

bestCost

for k

1 to i r do
temp I (i; r; k) + Isum [i
if temp < bestCost then
bestCost temp
bestK
k
Isum [i bestCost
kopt [i bestK

Table 1: Algorithm for omputing k and ost values.


a su iently tight lower bound requires making very strong assumptions on the form of
M (L; n). Instead, note that the two upper bounds are omparable with respe t to M (L; n)
and di er only by the leading fa tor N .

4.3 Computing the Cost and k-Sequen e


The equations for Isum (n; r) and kopt (n; r) suggest a simple O(N 2 ) dynami programming
solution for omputing both the ost and optimal k-sequen e for a problem of N variables.
Table 1 shows an algorithm for omputing a table of ost and k values for ea h i with
r + 1  i  N . The algorithm lls in the tables of values by starting with small n, and
bootstrapping to nd values for in reasingly large n. The fun tion I (n; r; k) in Table 1 is
omputed as des ribed above.
The O(N 2 ) ost of omputing the sequen e of k values is of some on ern. When N is
large and the learning algorithm requires time only linear in N , the ost of omputing the
optimal k-sequen e ould ex eed the ost of removing the irrelevant variables. In pra ti e
the ost of omputing values for k is negligible for problems up to N = 1000. For larger
problems, one solution is simply to set k = nr as in Se tion 4.2.1. The analysis shows that
this produ es good performan e and requires no omputational overhead.
5. The Randomized Variable Elimination Algorithm

Randomized variable elimination ondu ts a ba kward sear h through the spa e of variable
subsets, eliminating one or more variables per step. Randomization allows for sele tion
of irrelevant variables with high probability, while sele ting multiple variables allows the
algorithm to move through the spa e without in urring the ost of evaluating the intervening
points in the spa e. RVE ondu ts its sear h along a very narrow traje tory. The spa e of
variable subsets is sampled sparsely, rather than broadly and uniformly. This stru tured
1341

Stra uzzi and Utgoff

yet random sear h allows RVE to redu e substantially the total ost of sele ting relevant
variables.
A ba kward approa h serves two purposes for this algorithm. First, ba kward elimination eases the problem of re ognizing irrelevant or redundant variables. As long as a ore set
of relevant variables remains inta t, removing other variables should not harm the performan e of a learning algorithm. Indeed, the learner's performan e may in rease as irrelevant
features are removed from onsideration. In ontrast, variables whose relevan e depends
on the presen e of other variables may have no noti eable e e t when sele ted in a forward
manner. Thus, mistakes should be re ognized immediately via ba kward elimination, while
good sele tions may go unre ognized by a forward sele tion algorithm.
The se ond purpose of ba kward elimination is to ease the pro ess of sele ting variables.
If most variables in a problem are irrelevant, then a random sele tion of variables is naturally
likely to un over them. Conversely, a random sele tion is unlikely to turn up relevant
variables in a forward sear h. Thus, the forward sear h must work harder to nd ea h
relevant variable than ba kward sear h does for irrelevant variables.

5.1 Algorithm
The algorithm begins by omputing the values of kopt (i; r) for all r + 1  i  n. Next
it generates an initial hypothesis based on all n input variables. Then, at ea h step, the
algorithm sele ts kopt (n; r) input variables at random for removal. The learning algorithm
is trained on the remaining n k inputs, and a hypothesis h is produ ed. If the error
e(h0 ) of hypothesis h0 is less than the error e(h) of the previous hypothesis h (possibly
within a given toleran e), then the sele ted k inputs are marked as irrelevant and are all
simultaneously removed from future onsideration. Kohavi and John (1997) provide an in
depth dis ussion on evaluating and omparing hypotheses based on limited data sets. If
the learner was unsu essful, meaning the new hypothesis had a larger error, then at least
one of the sele ted variables was relevant. A new set of inputs is sele ted and the pro ess
repeats. The algorithm terminates when all n r irrelevant inputs have been removed.
Table 2 shows the RVE algorithm.
The stru tured sear h performed by RVE is easily distinguished from other randomized
sear h methods. For example, geneti algorithms maintain a population of states in the
sear h spa e and randomly mate the states to produ e o spring with properties of both
parents. The e e t is an initially broad sear h that targets more spe i areas as the sear h
progresses. A wide variety of subsets are explored, but the ost of so mu h exploration an
easily ex eed the ost of a traditional greedy sear h. See Goldberg (1989) or Mit hell (1996)
for detailed dis ussions on how geneti algorithms ondu t sear h.
While GAs tend to drift through the sear h spa e based on the properties of individuals
in the population, the LVF algorithm (Liu and Setino, 1996) samples the spa e of variable
subsets uniformly. LVF sele ts both the size of ea h subset and the member variables at
random. Although su h an approa h is not sus eptible to \bad de isions" or lo al minima,
the probability of nding a best or even good variable subset de reases exponentially as
the number of irrelevant variables in reases. Unlike RVE, LVF is a ltering method, whi h
relies on the in onsisten y rate (number of equivalent instan es divided by number of total
instan es) in the data with respe t to the sele ted variables.
1342

Randomized Variable Elimination

Given: L, n, r, toleran e
ompute tables for Isum (i; r) and kopt (i; r)
h hypothesis produ ed by L on n inputs

while n > r do
k

kopt (n; r)
sele t k variables at random and remove them
h0 hypothesis produ ed by L on n k inputs
if e(h0 ) e(h)  toleran e then

else

n
h

n k
h0

repla e the sele ted k variables

Table 2: Randomized ba kward-elimination variable sele tion algorithm.

5.2 A Simple Example


The pre eding presentation of the RVE algorithm has remained stri tly general, relying on
no spe i learning algorithm or ost metri . We onsider now a spe i example of how
the randomized variable elimination algorithm may be applied to a linear threshold unit.
The spe i task examined here is to learn a Boolean fun tion that is true when seven out
of ten relevant variables are true, given a total of 100 input variables. In order to ensure
that the hypotheses generated for ea h sele tion of variables has nearly minimal error, we
use the thermal per eptron training algorithm (Frean, 1992). The thermal per eptron uses
simulated annealing to settle weights regardless of data separability. The po ket algorithm
(Gallant, 1990) is also appli able, but we found this to be slower and prone to more testing
errors.
Twenty problems were generated randomly with N = 100 input variables, of whi h 90
are irrelevant and r = 10 are relevant. Ea h of the twenty problems used a di erent set
of ten relevant variables (sele ted at random) and di erent data sets. Two data sets, ea h
with 1000 instan es, were generated independently for ea h problem. One data set was
used for training while the other was used to validate the error of the hypotheses generated
during ea h round of sele tions. The values of the 100 input variables were all generated
independently. The mean number of unique instan es with respe t to the ten relevant
variables was 466 .
The rst step in applying the RVE algorithm is to de ne the ost metri and the fun tion M (L; n) for learning on n inputs. For the per eptron, we hoose the number of weight
updates as the metri . The thermal per eptron anneals a temperature T that governs the
magnitude of the weight updates. Here we used T0 = 2 and de ayed the temperature
at a rate of 0.999 per training epo h until T < 0:3 (we observed no hange in the hypotheses produ ed by the algorithm for T < 0:3). Given the temperature and de ay rate,
exa tly 1897 training epo hs are performed ea h time a thermal per eptron is trained. With
1343

Stra uzzi and Utgoff

Total Updates (109 )

30

RVE
Single
M (L; N )

25
20
15
10
5
0

50

100

150 200 250 300 350


Number of Input Variables

400

450

500

Figure 1: Plot of the expe ted ost of running RVE (Isum (N; r = 10)) along with the ost
of removing inputs individually, and the estimated number of updates M (L; N ).

1000 instan es in the training data, the ost of running the learning algorithm is xed at
M (L; n) = 1897000(n + 1). Given the above ost formula for an n-input per eptron, a table
of values for Isum (n; r) and kopt (n; r) an be onstru ted.
Figure 1 plots a omparison of the omputed ost of the RVE algorithm, the ost of
removing variables individually, and the estimated number of updates M (L; N ) of an N input per eptron. The al ulated ost of the RVE algorithm maintains a linear growth rate
with respe t to N , while the ost of removing variables individually grows as N 2 . This
agrees with our analysis of the RVE and individual removal approa hes. Relationships
similar to that shown in Figure 1 arise for other values of r, although the onstant fa tor
that separates Isum (n; r) and M (L; n) in reases with r.
After reating the table kopt (n; r), the sele tion and removal pro ess begins. Sin e the
seven-of-ten learning problem is linearly separable, the toleran e for omparing the new and
urrent hypotheses was set to near zero. A small toleran e of 0.06 (equivalent to about 15
mis lassi ations) is ne essary sin e the thermal per eptron does not guarantee a minimum
error hypothesis.
We also allow the urrent hypothesis to bias the next by not randomizing the weights
(of remaining variables) after ea h pass of RVE. Small value weights, suggesting potential
irrelevant variables, an easily transfer from one hypothesis to the next, although this is
not guaranteed. Seeding the per eptron weights may in rease the han e of nding a linear
separator if one exists. If no separator exists, then seeding the weights should have minimal
impa t. In pra ti e we found that the e e t of seeding the weights was nulli ed by the
po ket per eptron's use of annealing.
1344

Number Inputs

Randomized Variable Elimination

100
90
80
70
60
50
40
30
20
10
0

RVE
Individual

6
9
Total Updates (109 )

12

15

Figure 2: A omparison between the number of inputs on whi h the per eptrons are trained
and the mean aggregate number of updates performed by the per eptrons.

5.3 Example Results


The RVE algorithm was run using the twenty problems des ribed above. Hypotheses based
on ten variables were produ ed using an average of 5:45  109 weight updates, 81.1 alls to
the learning algorithm, and 359.9 se onds on a 3.12 GHz Intel Xenon pro essor. A version
of the RVE algorithm that removes variables individually (i.e. k was set permanently to 1)
was also run, and produ ed hypotheses using 12:7  109 weight updates, 138.7 alls to the
learner, and 644.7 se onds. These weight update values agree with the estimate produ ed
by the ost fun tion. Both versions of the algorithm generated hypotheses that in luded
irrelevant and ex luded relevant variables for three of the test problems. All ases in whi h
the nal sele tion of variables was in orre t were pre eded by an initial hypothesis (based
on all 100 variables) with unusually high error (error greater than 0.18 or approximately 45
mis lassi ed instan es). Thus, poor sele tions o ured for runs in whi h the rst hypothesis
produ ed has high error due to annealing in the po ket per eptron.
Figure 2 plots the average number of inputs used for ea h variable set size (number
of inputs) ompared to the total number of weight updates. Ea h marked point on the
plot denotes a size of the set of input variables given to the per eptron. The error bars
indi ate the standard deviation in number of updates required to rea h that point. Every
third point is plotted for the individual removal algorithm. Compare both the rate of drop
in inputs and the number of hypotheses trained for the two RVE versions. This re e ts
the balan e between the ost of training and unsu essful variable sele tions. Removing
variables individually in the presen e of many irrelevant variables ignores the ost of training
ea h hypothesis, resulting in a total ost that rises qui kly early in the sear h pro ess.
1345

Stra uzzi and Utgoff

6. Choosing

When

Is Unknown

The assumption that the number of relevant variables r is known has played a riti al
role in the pre eding dis ussion. In pra ti e, this is a strong assumption that is not easily
met. We would like an algorithm that removes irrelevant attributes e iently without su h
knowledge. One approa h would be simply to guess values for r and see how RVE fares.
This is unsatisfying however, as a poor guess an destroy the e ien y of RVE. In general,
guessing spe i values for r is di ult, but pla ing a loose bound around r may be mu h
easier. In some ases, the maximum value for r may be known to be mu h less than N ,
while in other ases, r an always be bounded by 1 and N .
Given some bound on the maximum rmax and minimum rmin values for r, a binary
sear h for r an be ondu ted during RVE's sear h for relevant variables. This relies on the
idea that RVE attempts to balan e the ost of learning against the ost of sele ting relevant
variables for removal. At ea h step of RVE, a ertain number of failures, E (n; r; k), are
expe ted. Thus, if sele ting variables for removal is too easy (i.e. we are sele ting too few
variables at ea h step), then the estimate for r is too high. Similarly, if sele tion fails an
inordinate number of times, then the estimate for r is too low.
The hoi e of when to adjust r is important. The sele tion pro ess must be allowed to
fail a ertain number of times for ea h su ess, but allowing too many failures will de rease
the e ien y of the algorithm. We bound the number of failures by 1 E (n; r; k) where
1 > 1 is a onstant. This allows for the failures pres ribed by the ost fun tion along with
some amount of \bad lu k" in the random variable sele tions. The number of onse utive
su esses is bounded similarly by 2 (r E (n; r; k)) where 2 > 0 is a onstant. Sin e
E (n; r; k)) is at most r, the value of this expression de reases as the expe ted number of
failures in reases. In pra ti e 1 = 3 and 2 = 0:3 appear to work well.

6.1 A General Purpose Algorithm


Randomized variable elimination in luding a binary sear h for r (RVErS | \reverse")
begins by omputing tables for kopt (n; r) for values of r between rmin and rmax . Next an
initial hypothesis is generated and the variable sele tion loop begins. The algorithm hooses
the number of variables to remove at ea h step based on the urrent value of r. Ea h time
the bound on the maximum number of su essful sele tions is ex eeded, rmax redu es to r
and a new value is al ulated as r = rmax +2 rmin . Similarly, when the bound on onse utive
failures is ex eeded, rmin in reases to r and r is re al ulated. The algorithm also he ks to
ensure that the urrent number of variables never falls below rmin . If this o urs, r; rmin and
rmax are all set to the urrent number of variables. RVErS terminates when rmin and rmax
onverge and 1 E (n; r; k) onse utive variable sele tions fail. Table 3 shows the RVErS
algorithm.
While RVErS an produ e good performan e without nding the exa t value of r, how
well the estimated value must approximate the a tual value is un lear. An important
fa tor in determining the omplexity of RVErS is how qui kly the algorithm rea hes a good
estimate for r. In the best ase, the sear h for r will settle on a good approximation of
the a tual number of relevant variables immediately, and the RVE omplexity bound will
apply. In the worst ase, the sear h for r will pro eed slowly over values of r that are too
high, ausing RVErS to behave like the individual removal algorithm.
1346

Randomized Variable Elimination

Given: L, 1 , 2 , n, rmax , rmin , toleran e


ompute tables Isum (i; r) and kopt (i; r) for rmin  r  rmax

rmax +rmin

su ess , fail 0
h hypothesis produ ed by L on n inputs

repeat
k

kopt (n; r)
sele t k variables at random and remove them
h0 hypothesis produ ed by L on n k inputs
if e(h0 ) e(h)  toleran e then
n n k
h h0

else

su ess
fail 0

su ess + 1

repla e the sele ted k variables


fail fail + 1
su ess 0

if r  rmin then

r, rmax , rmin

else if fail  1 E (n; r; k) then


rmin r
r rmax +2 rmin

su ess , fail 0
else if su ess  2 (r

rmax r
r rmax +2 rmin
su ess , fail

E (n; r; k)) then

until rmin < rmax and fail  1 E (n; r; k)


Table 3: Randomized variable elimination algorithm in luding a sear h for r.
With respe t to the analysis presented in Se tion 4.2.1, note that the onstants 1 and
2 do not impa t the total ost of performing variable sele tion. However, a large number
of adjustments to rmin and rmax do impa t the total ost negatively.

6.2 An Experimental Comparison of RVE and RVErS


The RVErS algorithm was applied to the seven-of-ten problems using the same onditions
as the experiments with RVE. Table 4 shows the results of running RVErS based on ve
values of rmax and rmin = 2. The results show that for in reasing values of rmax , the
1347

Stra uzzi and Utgoff

Algorithm
Mean Updates Mean Time (s) Mean Calls Mean Inputs
RVE (kopt )
5.5109
359.9
81.1
10.0
rmax = 20
6.5109
500.7
123.8
10.8
9
rmax = 40
8.010
603.8
151.3
10.2
rmax = 60
9.3109
678.8
169.0
10.0
9
rmax = 80
10.010
694.7
172.3
10.0
rmax = 100
11.7109
740.7
184.1
9.9
9
RVE (k = 1)
12.710
644.7
138.7
10.0
Table 4: Results of RVE and RVErS for several values of rmax . Mean alls refers to the
number alls made to the learning algorithm. Mean inputs refers to the number
of inputs used by the nal hypothesis.
performan e of RVErS degrades slowly with respe t to ost. The di eren e between RVErS
with rmax = 100 and RVE with k = 1 is signi ant at the 95% on den e level (p = 0:049),
as is the di eren e between RVErS with rmax = 20 and RVE with k = kopt (p = 0:0005).
However, this slow degradation does not hold in terms of run time or number of alls
to the learning algorithm. Here, only versions of RVErS with rmax = 20 or 40 show an
improvement over RVE with k = 1.
The RVErS algorithm termination riteria auses the sharp in rease in the number of
alls to the learning algorithm. Re all that as n approa hes r the probability of a failed
sele tion in reases. This means that the number of allowable sele tion failures grows as the
algorithm nears ompletion. Thus, the RVErS algorithm makes many alls to the learner
using a small number of inputs n in an attempt to determine whether the sear h should
be terminated. The sear h for r ompounds the e e t. If, at the end of the sear h, the
irrelevant variables have been removed but rmin and rmax have not onverged, then the
algorithm must work through several failed sequen es in order to terminate.
Figure 3 plots of the number of variables sele ted ompared to the average total number
of weight updates for rmax = 20, 60 and 100. The error bars represent the standard
deviation in the number of updates. Noti e the jump in the number of updates required for
the algorithm to rea h ompletion (represented by number of inputs equals ten) ompared
to the number of updates required to rea h twenty remaining inputs. This pattern does not
appear in the results of either version of the RVE algorithm shown in Figure 2. Tra es of
the RVErS algorithm support the on lusion that many alls to the learner are needed to
rea h termination even after the orre t set of variables has been found.
The in rease in run times follows dire tly from the in reasing number of alls to the
learner. The thermal per eptron algorithm arries a great deal of overhead not re e ted
by the number of updates. Sin e the algorithm exe utes for a xed number of epo hs, the
run time of any all to the learner will ontribute noti eably to the run time of RVErS,
regardless of the number of sele ted variables. Contrast this behavior to that of learner
whose ost is based more rmly on the number of input variables, su h as naive Bayes.
Thus, even though RVErS always requires fewer weight updates than RVE with k = 1, the
latter may still run faster.
1348

Number Inputs

Randomized Variable Elimination

100
90
80
70
60
50
40
30
20
10
0

rmax = 20
rmax = 60

rmax = 100

6
9
Total Updates (109 )

12

15

Figure 3: A omparison between the number of inputs on whi h the thermal per eptrons
are trained and the aggregate number of updates performed using the RVErS
algorithm.
This result suggests that the termination riterion of the RVErS algorithm is awed.
The large number of alls to the learner at the end of the variable elimination pro ess
wastes a portion of the advantage generated earlier in the sear h. More importantly, the
ex ess number of alls to the learner does not respe t the very areful sear h traje tory
omputed by the ost fun tion. Although our ost fun tion for the learner M (L; n) does
take the overhead of the thermal per eptron algorithm into a ount, there is no allowan e
for unne essary alls to the learner. Future resear h with randomized variable elimination
should therefore in lude a better termination riterion.
r

7. Experiments with RVE S

We now examine the general performan e properties of randomized variable elimination


via experiments with several data sets. The previous experiments with per eptrons on the
seven-of-ten problem fo used on performan e with respe t to the ost metri . The following
experiments are on erned primarily with minimizing run time and the number of alls to
the learner while maintaining or improving a ura y. All tests were run on a 3.12 GHz Intel
Xenon pro essor.
Unlike the linearly-separable per eptron experiments, the problems used here do not
ne essarily have solutions with zero test error. The learning algorithms may produ e hypotheses with more varian e in a ura y, requiring a more sophisti ated evaluation fun tion.
The utility of variable sele tion with respe t to even the most sophisti ated learning algorithms is well known, see for example Kohavi and John (1997) or Weston, Mukherjee,
1349

Stra uzzi and Utgoff

Data Set
Variables Classes Train Size Test Size Values of rmax
internet-ad
1558
2
3279
CV 1558, 750, 100
mult. feature
240
10
2000
CV
240, 150, 50
DNA
180
3
2000
1186
180, 100, 50
LED
150
10
2000
CV
150, 75, 25
opt-digits
64
10
3823
1797
64, 40, 25
soybean
35
19
683
CV
35, 25, 15
si k-euthyroid
25
2
3164
CV
25, 18, 10
monks-2-lo al
17
2
169
432
17, 10, 5
Table 5: Summary of data sets.
Chapelle, Pontil, Poggio, and Vapnik (2000). The goal here is to show that our omparatively liberal elimination method sa ri es little in terms of a ura y and gains mu h in
terms of speed.

7.1 Learning Algorithms


The RVErS algorithm was applied to two learning algorithms. The rst is the C4.5 release
8 algorithm (Quinlan, 1993) for de ision tree indu tion with options to avoid pruning and
early stopping. We avoid pruning and early stopping be ause these are forms of variable
sele tion, and may obs ure the performan e of RVErS. The ost metri for C4.5 is based
on the number of alls to the gain-ratio data purity riterion. The ost of indu ing a tree is
therefore roughly quadrati in the number of variables: one all per variable, per de ision
node, with at most a linear number of nodes in the tree. Re all that an exa t metri is not
needed, only the order with respe t to the number of variables must be orre t.
The se ond learning algorithm used is naive Bayes, implemented as des ribed by Mit hell
(1997). Here, the ost metri is based on the number of operations required to build the
onditional probability table, and is therefore linear in the number of inputs. In pra ti e,
these tables need not be re omputed for ea h new sele tion of variables, as the irrelevant
table entries an simply be ignored. However, we re ompute the tables here to illustrate
the general ase in whi h the learning algorithm must start from s rat h.

7.2 Data Sets


A total of eight data sets were sele ted. Table 5 summarizes the data sets, and do umentation is generally available from the UCI repository (Blake and Merz, 1998), ex ept for the
DNA problem, whi h is from StatLog (King et al., 1995). The rst ve problems re e t a
preferen e for data with an abundan e of variables and a large number of instan es in order
to demonstrate the e ien y of RVErS. The last three problems are in luded to show how
RVErS performs on smaller problems, and to allow omparison with other work in variable
sele tion. Further tests on smaller data sets are possible, but not instru tive, as randomized
elimination is not intended for data sets with few variables.
Three of the data sets (DNA, opt-digits, and monks) in lude predetermined training
and test sets. The remaining problems used ten-fold ross validation. The version of the
1350

Randomized Variable Elimination

LED problem used here was generated using ode available at the repository, and in ludes
a orruption of 10% of the lass labels. Following Kohavi and John, the monks-2 data used
here in ludes a lo al (one of n) en oding for ea h of the original six variables for a total of
17 Boolean variables. The original monks-2 problem ontains no irrelevant variables, while
the en oded version ontains six irrelevant variables.

7.3 Methodology
For ea h data set and ea h of the two learning algorithms (C4.5 and naive Bayes), we ran
four versions of the RVErS algorithm. Three versions of RVErS use di erent values of
rmax in order to show how the hoi e of rmax a e ts performan e. The fourth version is
equivalent to RVE with k = 1 using a stopping riterion based on the number of onse utive
failures (as in RVErS). This measures the performan e of removing variables individually
given that the number of relevant variables is ompletely unknown. For omparison, we
also ran forward step-wise sele tion, ba kward step-wise elimination and a hybrid ltering
algorithm. The ltering algorithm simply ranked the variables by gain-ratio, exe uted the
learner using the rst 1, 2, 3, : : :, N variables, and sele ted the best.
The learning algorithms used here provide no performan e guarantees, and may produ e
highly variable results depending on variable sele tions and available data. All seven sele tion algorithms therefore perform ve-fold ross-validation using the training data to obtain
an average hypothesis a ura y generated by the learner for ea h sele tion of variables. The
methods proposed by Kohavi and John (1997) ould be used to improve error estimates for
ases in whi h the varian e in hypothesis error rates is high. Their method should provide
reliable estimates for adjusting the values of rmin and rmax regardless of learning algorithm.
Preliminary experiments indi ated that the RVErS algorithm is more prone to be oming
bogged down during the sele tion pro ess than deterministi algorithms. We therefore set
a small toleran e (0:002) as shown in Table 3, whi h allows the algorithm to keep only
very good sele tions of variables while still preventing the sele tion pro ess from stalling
unne essarily. We have not performed extensive tests to determine ideal toleran e values.
The nal sele tions produ ed by the algorithms were evaluated in one of two ways.
Domains for whi h there no spe i test set is provided were evaluated via ten-fold rossvalidation. The remaining domains used the provided training and test sets. In the se ond
ase, we ran ea h of the four RVErS versions ve times in order to smooth out any u tuations due to the random nature of the algorithm.

7.4 Results
Tables 6{9 summarize the results of running the RVErS algorithm on the given data sets
using naive Bayes and C4.5 for learning algorithms. In the tables, iters denotes the number of sear h iterations, evals denotes the number of subset evaluations performed, inputs
denotes the size of the nal set of sele ted variables, error rates in lude the standard deviation where appli able, and ost represents the total ost of the sear h with respe t to the
learner's ost metri . The rst row in ea h blo k shows the performan e of the learner prior
to variable sele tion, while the remaining rows show the performan e of the seven sele tion
algorithms. Finally, \NA" indi ates that the experiment was terminated due to ex essive
omputational ost.
1351

Stra uzzi and Utgoff

Data Set
internet

Learning
Algorithm
Bayes

Sele tion
Algorithm

Iters

Subset
Evals

= 100
rmax = 750
rmax = 1558
k = 1
forward
ba kward
lter

137
536
845
1658
20

137
536
845
1658
30810

1558

1558

= 100
= 750
rmax = 1558
k = 1
forward
ba kward
lter

340
1233
1489
1761
19

340
1233
1489
1761
28647

1558

1558

= 50
= 150
rmax = 240
k = 1
forward
ba kward
lter

53
84
112
341
20
186
240

53
84
112
341
4539
27323
240

= 50
= 150
rmax = 240
k = 1
forward
ba kward
lter

306
459
474
460
26
151
240

306
459
474
460
5960
24722
240

rmax

internet

C4.5

rmax
rmax

mult-ftr

Bayes

rmax
rmax

mult-ftr

C4.5

rmax
rmax

Inputs
1558
37.8
9.2
17.5
8.8
18.9
837
1558
33.9
25.7
20.0
20.6
17.5
640
240
18.8
19.4
19.9
17.2
18.7
55.6
53.6
240
22.3
21.3
22.0
22.9
25.3
90.8
140

Per ent
Error
3.00:9
3.01:2
3.21:2
2.90:8
3.01:2
2.50:8
NA
3.10:9
3.00:8
3.31:0
3.71:2
3.21:3
3.31:0
3.21:2
NA
3.11:0
34.14:5
18.32:0
17.54:7
17.52:2
15.73:0
12.31:6
13.91:7
22.52:7
22.04:0
22.12:0
20.22:7
22.13:5
20.52:5
20.43:5
20.43:1
21.22:7

Time
(se )
0.5
165
790
1406
2685
22417

Sear h
Cost
4.61106
6.13108
3.99109
8.26109
1.461010
5.32109

2614
48
5386
40656
78508
91204
18388

1.441010
2.04105
3.30107
2.61108
5.02108
6.02108
1.95107

77608
0.1
13
27
41
99
527
12097
83
0.6
241
427
519
523
2004
51018
354

3.98108
4.51105
3.09107
7.28107
1.13108
2.57108
7.55108
3.521010
2.30108
3.74104
1.13107
2.12107
2.66107
2.71107
5.06107
2.90109
1.93107

Table 6: Variable sele tion results using the naive Bayes and C4.5 learning algorithms.
The performan e of RVErS on the ve largest data sets is en ouraging. In most ases
RVErS was omparable to the performan e of step-wise sele tion with respe t to generalization, while requiring substantially less omputation. This e e t is most lear in the mult-ftr
data set, where forward sele tion with the C4.5 learner required nearly six CPU days to run
(for ten-fold ross-validation) while the slowest RVErS version required just six hours. An
ex eption to this trend o urs with the internet-ad data using C4.5. Here, the huge ost of
running C4.5 with most of the variables in luded overwhelms RVErS's ability to eliminate
variables qui kly. Only the most aggressive run of the algorithm, with rmax = 100, manages
to bypass the problem.
The internet-ad via C4.5 experiment highlights a se ond point. Noti e how the forward
sele tion algorithm runs faster than all but one version of RVErS. In this ase, the ost and
1352

Randomized Variable Elimination

Data Set
DNA

Learning
Algorithm
Bayes

Sele tion
Algorithm

Iters

Subset
Evals

= 50
rmax = 100
rmax = 180
k = 1
forward
ba kward
lter

359
495
519
469
19
34
180

359
495
519
469
3249
5413
180

= 50
= 100
rmax = 180
k = 1
forward
ba kward
lter

356
384
432
374
13
110
180

356
384
432
374
2262
13735
180

= 25
= 75
rmax = 150
k = 1
forward
ba kward
lter

127
293
434
423
14
14
150

127
293
434
423
2006
1870
150

= 25
= 75
rmax = 150
k = 1
forward
ba kward
lter

85
468
541
510
9
61
150

85
468
541
510
1286
7218
150

rmax

DNA

C4.5

rmax
rmax

LED

Bayes

rmax
rmax

LED

C4.5

rmax
rmax

Inputs
180
24.2
30.0
25.6
23.6
18.0
148
101
180
17.0
16.2
13.8
14.4
12.0
72.0
17.0
150
22.7
17.4
25.6
23.7
13.0
138.0
23.7
150
51.1
25.8
25.2
32.4
7.8
90.9
7.1

Per ent
Error
6.7
4.70:7
4.90:8
5.00:5
4.70:3
5.8
6.5
5.7
9.7
8.11:7
7.11:5
6.51:2
6.51:1
5.9
8.7
7.6
30.33:0
26.93:9
26.03:3
25.92:6
27.02:1
26.62:9
30.12:6
27.12:1
43.94:5
42.03:0
42.54:5
40.85:7
42.52:7
27.03:2
43.53:5
27.33:5

Time
(se )
0.08
52
75
84
76
269
1399
32
0.5
198
222
282
274
418
18186
163
0.09
19
50
86
85
141
667
34
0.5
89
363
440
439
196
11481
156

Sear h
Cost
3.63105
1.52108
2.39108
2.89108
2.56108
2.99108
7.16109
1.33108
1.95104
8.42106
9.07106
1.21107
1.18107
2.33106
8.23108
7.10106
2.75105
3.97107
1.09108
2.02108
2.04108
1.54108
1.95109
8.49107
5.48104
1.01107
3.70107
4.48107
4.63107
9.52105
1.33109
1.69107

Table 7: Variable sele tion results using the naive Bayes and C4.5 learning algorithms.
time of running C4.5 many times on a small number of variables is less than that of running
C4.5 few times on many variables. However, note that a slight hange in the number of
iterations needed by the forward algorithm would hange the time and ost of the sear h
dramati ally. This is not the ase for RVErS, sin e ea h iteration involves only a single
evaluation instead of O(N ) evaluations.
The number of subset evaluations made by RVErS is also important. Noti e the growth
in number of evaluations with respe t to the total (initial) number of inputs. For aggressive
versions of RVErS, growth is very slow, while more onservative versions, su h as k = 1
grow approximately linearly. This suggests that the theoreti al results dis ussed for RVE
remain valid for RVErS. Additional tests using data with many hundreds or thousands of
1353

Stra uzzi and Utgoff

Data Set
opt-digits

Learning
Algorithm
Bayes

Sele tion
Algorithm

Iters

Subset
Evals

= 25
rmax = 40
rmax = 64
k = 1
forward
ba kward
lter

111
157
162
150
17
41
64

111
157
162
150
952
1781
64

= 25
= 40
rmax = 64
k = 1
forward
ba kward
lter

130
158
216
140
16
28
64

130
158
216
140
904
1378
64

= 15
= 25
rmax = 35
k = 1
forward
ba kward
lter

142
135
132
88
13
19
35

142
135
132
88
382
472
35

= 15
= 25
rmax = 35
k = 1
forward
ba kward
lter

118
158
139
117
16
18
35

118
158
139
117
435
455
35

rmax

opt-digits

C4.5

rmax
rmax

soybean

Bayes

rmax
rmax

C4.5

rmax
rmax

Inputs
64
14.2
13.2
14.4
14.0
16.0
25.0
37.0
64
12.0
10.8
10.4
11.4
15.0
38.0
50.0
35
12.6
11.9
11.2
12.3
12.3
18.0
31.3
35
16.3
14.7
16.3
16.1
14.8
19.1
30.8

Per ent
Error
17.4
15.71:5
14.90:7
14.71:0
14.21:1
14.1
13.5
16.1
43.2
42.42:0
42.20:3
42.51:2
42.11:1
41.6
44.0
43.6
7.82:4
8.94:2
10.55:8
9.85:1
9.65:0
7.32:9
7.94:6
7.82:6
8.64:0
9.54:6
10.14:1
9.13:7
9.33:5
9.14:0
10.44:4
8.53:7

Time
(se )
0.08
15.9
22.9
24.9
24.7
93.8
423.0
11.9
0.7
148
181
253
189
645
2842
87
0.02
5.9
5.8
5.8
4.6
8.9
37.5
2.0
0.04
13.5
18.6
17.3
14.9
33.7
69.0
3.7

Sear h
Cost
2.59105
4.05107
5.92107
6.66107
6.76107
1.91108
1.39109
3.63107
2.15104
3.64106
4.42106
6.36106
5.33106
9.64106
9.55107
2.12106
2.40104
6.47106
6.26106
6.17106
4.67106
1.09107
3.63107
1.97106
1.21103
2.78105
1.90105
3.86105
3.52105
3.22105
1.75106
6.06104

Table 8: Variable sele tion results using the naive Bayes and C4.5 learning algorithms.
variables would be instru tive, but may not be feasible with respe t to the deterministi
sear h algorithms.
RVErS does not a hieve the same e onomy of subset evaluations on the three smaller
problems as on the larger problems. This is not surprising, sin e the ratio of relevant
variables to total variables is mu h smaller, requiring RVErS to pro eed more autiously.
In these ases, the value of rmax has only a minor e e t on performan e, as RVErS is unable
to remove more than two or three variables in any given step.
One problem eviden ed by both large and small data sets is that there appears to be no
lear hoi e of a best value for rmax . Conservative versions of RVErS tend to produ e lower
error rates, but there are ex eptions. In some ases, rmax has very little e e t on error.
However, in most ases, small values of rmax have a distin t positive e e t on run time.
1354

Randomized Variable Elimination

Data Set
euthyroid

Learning
Algorithm
Bayes

Sele tion
Algorithm

Iters

Subset
Evals

= 10
rmax = 18
rmax = 25
k = 1
forward
ba kward
lter

30
39
46
35
5
16
25

30
39
46
35
118
263
25

= 10
= 18
rmax = 25
k = 1
forward
ba kward
lter

49
63
55
54
7
16
25

49
63
55
54
151
269
25

=5
= 10
rmax = 17
k = 1
forward
ba kward
lter

25
54
74
41
2
8
17

25
54
74
41
33
99
17

=5
= 10
rmax = 17
k = 1
forward
ba kward
lter

36
84
79
55
2
13
17

36
84
79
55
33
139
17

rmax

C4.5

rmax
rmax

monks-2

Bayes

rmax
rmax

C4.5

rmax
rmax

Inputs
25
2.0
2.0
2.3
1.7
4.2
11.3
4.7
25
2.9
3.3
2.7
3.8
5.9
11.0
15.2
17
2.6
4.0
6.0
6.0
1.0
11.0
2.0
17
8.2
6.2
6.4
6.2
1.0
6.0
13.0

Per ent
Error
6.21:3
4.61:5
4.81:2
5.11:4
5.01:2
4.61:1
4.21:3
4.20:6
2.71:0
2.40:8
2.20:7
2.50:8
2.30:9
2.40:7
2.50:9
2.71:1
39.4
36.13:2
37.22:3
37.43:1
36.83:1
32.9
38.4
40.3
23.6
16.710:9
4.60:4
6.54:7
4.40:0
32.9
4.4
35.6

Time
(se )
0.02
2.1
1.3
1.6
1.5
3.4
14.4
1.4
0.2
21.0
27.6
25.3
29.4
51.8
200.0
15.4
0.01
0.02
0.05
0.08
0.04
0.02
0.13
0.01
0.03
0.8
1.9
1.8
1.4
0.6
3.9
0.4

Sear h
Cost
7.54104
2.65106
3.73106
4.32106
4.86106
6.45106
5.91107
4.16106
1.00103
2.98104
4.58104
4.92104
6.39104
3.89104
6.73105
3.60104
3.11103
1.10105
2.50105
4.93105
2.89105
6.70104
9.93105
1.21105
5.14102
2.34104
3.74104
4.63104
4.13104
3.95102
1.61105
1.51104

Table 9: Variable sele tion results using the naive Bayes and C4.5 learning algorithms.
The results suggest two other somewhat surprising on lusions. One is that ba kward
elimination does not appear to have the ommonly assumed positive e e t on generalization.
Step-wise forward sele tion tends to outperform step-wise ba kward elimination, although
randomization often redu es this e e t. The se ond on lusion is that the hybrid lter
algorithm performs well in some ases, but worse than RVErS and step-wise sele tion in
most ases. Noti e also that for problems with many variables, RVErS runs as fast or faster
than the lter. Additional experiments along these lines would be instru tive.
Over tting is sometimes a problem with greedy variable sele tion algorithms. Figures 4
and 5 show both the test and inner (training) ross-validation error rates for the sele tion
algorithms on naive Bayes and C4.5 respe tively. Solid lines indi ate test error, while dashed
lines indi ate the inner ross-validation error. Noti e that the test error is not always
1355

Stra uzzi and Utgoff

RVErS rmax=50

RVErS rmax=100

7
Test
CV

6.5
6

5.5

Error

Error

Test
CV

6.5

5.5
5

4.5

4.5

4
3.5

4
0

50 100 150 200 250 300 350


Iterations

0 50 100 150 200 250 300 350 400


Iterations

RVErS rmax=180

RVErS k=1

7
Test
CV

6.5
6

5.5

Error

Error

Test
CV

6.5

5.5
5

4.5

4.5

4
3.5

4
0

100

200 300
Iterations

400

500

0 50 100 150 200 250 300 350 400


Iterations

40
35
30
25
20
15
10
5
0

Backward
7

Test
CV

Test
CV

6.5
6
Error

Error

Forward

5.5
5
4.5
4

8 10 12 14 16 18
Iterations

10 15 20 25
Iterations

Figure 4: Naive Bayes over tting plots for DNA data.

1356

30 35

Randomized Variable Elimination

RVErS rmax=50

RVErS rmax=100

13
Test
CV

12

10

Error

Error

11

9
8
7
6
0

10.5
10
9.5
9
8.5
8
7.5
7
6.5

50 100 150 200 250 300 350 400


Iterations

Test
CV

50 100 150 200 250 300


Iterations

RVErS rmax=180

RVErS k=1

11
Test
CV

10

Error

Error

9
8
7
6
5
0

10.5
10
9.5
9
8.5
8
7.5
7
6.5
6

50 100 150 200 250 300 350


Iterations

Test
CV

50 100 150 200 250 300


Iterations

40
35
30
25
20
15
10
5
0

Backward
11

Test
CV

Test
CV

10
9
Error

Error

Forward

8
7
6
5
4

6
8
Iterations

10

12

20

40

60 80
Iterations

Figure 5: C4.5 over tting plots for DNA data.

1357

100 120

Stra uzzi and Utgoff

minimized with the nal sele tions produ ed by RVErS. The graphs show that RVErS does
tend to over t naive Bayes, but not C4.5 (or at least to a lesser extent). Tra e data from
the other data sets agree with this on lusion.
There are at least two possible explanations for over tting by RVErS. One is that the
toleran e level either auses the algorithm to ontinue eliminating variables when it should
stop, or allows elimination of relevant variables. In either ase, a better adjusted toleran e
level should improve performan e. The monks-2 data set provides an example. In this ase,
if the toleran e is set to zero, RVErS reliably nds variable subsets that produ e low-error
hypotheses with C4.5.
A se ond explanation is that the stopping riteria, whi h be omes more di ult to
satisfy as the algorithm progresses, auses the elimination pro ess to be ome overzealous.
In this ase the solution may be to augment the given stop riteria with a hold-out data set
(in addition to the validation set). Here the algorithm monitors performan e in addition
to ounting onse utive failures, returning the best sele tion, rather than simply the last.
Combining this over tting result with the above performan e results suggests that RVErS
is apable of performing quite well with respe t to both generalization and speed.
8. Dis ussion

The speed of randomized variable elimination stems from two aspe ts of the algorithm.
One is the use of large steps in moving through the sear h spa e of variable sets. As the
number of irrelevant variables grows, and the probability of sele ting a relevant variable at
random shrinks, RVE attempts to take larger steps toward its goal of identifying all of the
irrelevant variables. In the fa e of many irrelevant variables, this is a mu h easier task than
attempting to identify the relevant variables.
The se ond sour e of speed in RVE is the approa h of removing variables immediately,
instead of nding the best variable (or set) to remove. This is mu h less onservative
than the approa h taken by step-wise algorithms, and a ounts for mu h of the bene t
of RVE. In pra ti e, the full bene t of removing multiple variables simultaneously may
only be beginning to materialize in the data sets used here. However, we expe t that as
domains s ale up, multiple sele tions will be ome in reasingly important. One example of
this o urs in the STL algorithm (Utgo and Stra uzzi, 2002), whi h learns many on epts
over a period of time. There, the number of available input variables grows as more on epts
are learned by the system.
Consider brie y the ost of forward sele tion wrapper algorithms. Greedy step-wise
sear h is bounded by O(rNM (L; r)) for forward sele tion and O(N (N r)M (L; N )) for
ba kward elimination, provided it does not ba ktra k or remove (or add) previously added
(or removed) variables. The bound on the ba kward approa h re e ts both the larger number of steps required to remove the irrelevant variables and the larger number of variables
used at ea h all to the learner. The ost of training ea h hypothesis is small in the forward
greedy approa h ompared to RVE, sin e the number of inputs to any given hypothesis is
mu h smaller (bounded roughly by r). However, the number of alls to the learning algorithm is polynomial in N . As the number of irrelevant variables in reases, even a forward
greedy approa h to variable sele tion be omes qui kly unmanageable.
1358

Randomized Variable Elimination

The ost of a best- rst sear h using ompound operators (Kohavi and John, 1997) is
somewhat harder to analyze. Their approa h ombines the two best operators (e.g. add
variable or remove variable) and then he ks whether the result is an improvement. If
so, the resulting operator is ombined with the next best operator and tested, ontinuing
until there is no improvement. Theoreti ally this type of sear h ould nd a solution using
approximately 2r forward evaluations or 2(N r) ba kward subset evaluations. However,
this would require the algorithm to make the orre t hoi e at every step. The experimental
results (Kohavi and John, 1997) suggest that in pra ti e the algorithm requires many more
subset evaluations than this minimum.
Compare the above bounds on forward and ba kward greedy sear h to that of RVE given
a xed k = 1, whi h is O(rNM (L; N )). Noti e that the number of alls to the learning
algorithm is the same for RVE with xed k and a greedy forward sear h (the ost of learning
is di erent however). The empiri al results support the on lusion that the two algorithms
produ e similar ost, but also show that RVE with k = 1 requires less CPU time. The sour e
of this additional e onomy is un lear, although it may be related to various overhead osts
asso iated with the learning algorithms. RVE requires many fewer total learner exe utions,
thereby redu ing overhead.
In pra ti e, the k = 1 version of RVErS often makes fewer than rN alls to the learning
algorithm. This follows from the very high probability of a su essful sele tion of an irrelevant variable at ea h step. In ases when N is mu h larger than r, the algorithm with k = 1
makes roughly N alls to the learner as shown in Tables 6 and 7. Additional e onomy may
also be possible when k is xed at one. Ea h variable should only need to be tested on e,
allowing RVErS to make exa tly N alls to the learner. Further experiments are needed to
on rm this intuition.
Although the RVE algorithm using a xed k = 1 is signi antly more expensive than
the optimal RVE or RVErS using a good guess for rmax , experiments and analysis show
that this simple algorithm is generally faster than the deterministi forward or ba kward
approa hes, provided that there are enough irrelevant variables in the domain. As the ratio
r=N de reases, and the probability of sele ting an irrelevant variable at random in reases,
the bene t of a randomized approa h improves. Thus, even when no information about
the number of relevant variables is available, a randomized, ba kward approa h to variable
sele tion may be bene ial.
A disadvantage to randomized variable sele tion is that there is no lear way to re over from poor hoi es. Step-wise sele tion algorithms sometimes onsider both adding
and removing variables at ea h step, so that no variable is ever permanently sele ted or
eliminated. A hybrid version of RVErS whi h onsiders adding a single variable ea h time
a set a variables is eliminated is possible, but this would ultimately negate mu h of the
algorithm's omputational bene t.
Step-wise sele tion algorithms are sometimes parallelized in order to speed the sele tion
pro ess. This is due in large part to the very high ost of step-wise sele tion. RVE mitigates
this problem to a point, but there is no obvious way to parallelize a randomized sele tion
algorithm. Parallelization ould be used to improve generalization performan e by allowing
the algorithm to evaluate several subsets simultaneously and then hoose the best.
1359

Stra uzzi and Utgoff

9. Future Work

There are at least three possible dire tions for future work with RVE. The rst is an improved method for hoosing k when r is unknown. We have presented an algorithm based
on a binary sear h, but RVErS still wastes a great deal of time de iding when to terminate
the sear h, and an qui kly degenerate into a one-at-a-time removal strategy if bad de isions are made early in the sear h. Noti e however, that this worst- ase performan e is still
better than stepwise ba kward elimination, and omparable to stepwise forward sele tion,
both popular algorithms.
A se ond dire tion for future work involves further study of the e e t of testing very
few of the possible su essors to the urrent sear h node. Testing all possible su essors is
the sour e of the high ost of most wrapper methods. If a sparse sear h, su h as that used
by RVE, does not sa ri e mu h quality in general, then other fast wrapper algorithms may
be possible.
A third possible dire tion involves biasing the random sele tions at ea h step. If a set of
k variables fails to maintain evaluation performan e, then at least one of the k must have
been relevant to the learning problem. Thus, variables in luded in a failed sele tion may be
viewed as more likely to be relevant. This \relevan e likelihood" an be tra ked throughout
the elimination pro ess and used to bias sele tions at ea h step.
10. Con lusion

The randomized variable elimination algorithm uses a two-step pro ess to remove irrelevant
input variables. First, a sequen e of values for k, the number input variables to remove
at ea h step, is omputed su h that the ost of removing all N r irrelevant variables
is minimized. The algorithm then removes the irrelevant variables by randomly sele ting
inputs for removal a ording to the omputed s hedule. Ea h step is veri ed by generating
and testing a hypothesis to ensure that the new hypothesis is at least as good as the
existing hypothesis. A randomized approa h to variable elimination that simultaneously
removes multiple inputs produ es a fa tor N speed-up over approa hes that remove inputs
individually, provided that the number r of relevant variables is known in advan e.
When number of relevant variables is not known, a sear h for r may be ondu ted in
parallel with the sear h for irrelevant variables. Although this approa h wastes some of the
bene ts generated by the theoreti al algorithm, a reasonable upper bound on the number of
relevant variables still produ es good performan e. When even this weaker ondition annot
be satis ed, a randomized approa h may still outperform the onventional deterministi
wrapper approa hes provided that the number of relevant variables is small ompared to
the total number of variables. A randomized approa h to variable sele tion is therefore
appli able whenever the target domain is believed to have many irrelevant variables.
Finally, we on lude that an expli it sear h through the spa e of variable subsets is not
ne essary to a hieve good performan e from a wrapper algorithm. Randomized variable
elimination provides ompetitive performan e without in urring the high ost of expanding
and evaluating all su essors of a sear h node. As a result, randomized variable elimination
s ales well beyond urrent wrapper algorithms for variable sele tion.
1360

Randomized Variable Elimination

A knowledgments

The authors thank Bill Hesse for his advi e on erning the analysis of RVE. This material is
based upon work supported by the National S ien e Foundation under Grant No. 0097218.
Any opinions, ndings, and on lusions or re ommendations expressed in this material are
those of the author(s) and do not ne essarily re e t the views of the National S ien e
Foundation.
Referen es

D. W. Aha and R. L. Bankert. Feature sele tion for ase-based lassi ation of loud types:
An empiri al omparision. In Working Notes of the AAAI-94 Workshop on Case-Based
Reasoning, pages 106{112, Seattle, WA, 1994. AAAI Press.
H. Almuallim and T. G. Dietteri h. Learning with many irrelevant features. In Pro eedings
of the Ninth National Conferen e on Arti ial Intelligen e, Anaheim, CA, 1991. MIT
Press.
C. L. Blake and C. J. Merz. U i repository of ma hine learning databases. Te hni al report,
University of California, Department of Information and Computer S ien e, 1998.
C. Cardie. Using de ision trees to improve ase-based learning. In Ma hine Learning: Pro eedings of the Tenth International Conferen e, Amherst, MA, 1993. Morgan Kaufmann.
R. Caruana and D. Freitag. Greedy attribute sele tion. In Ma hine Learning: Pro eedings
of the Eleventh International Conferen e, New Brunswi k, NJ, 1994. Morgan Kaufmann.
K. J. Cherkauer and J. W. Shavlik. Growing simpler de ision trees to fa ilitate knowledge
dis overy. In Pro eedings of the Se ond International Conferen e on Knowledge Dis overy
and Data Mining. AAAI Press, 1996.
W. W. Cooley and P. R. Lohnes. Multivariate data analysis. Wiley, New York, 1971.
P. A. Devijver and J. Kittler. Pattern re ognition: A statisti al approa h. Prenti e
Hall/International, 1982.
P. Domingos. Context sensitive feature sele tion for lazy learners. Arti ial Intelligen e
Review, 11:227{253, 1997.
G.H. Dunteman. Prin ipal omponents analysis. Sage Publi ations, In ., Newbury Park
CA, 1989.
B. Efron, T. Hastie, I. Johnstone, and R. Tibsharini. Least angle regression. Te hni al
Report TR-220, Stanford University, Department of Statisti s, 2003.
S. E. Fahlman and C. Lebiere. The as ade- orrelation learning ar hite ture. Advan es in
Neural Information Pro essing Systems, 2:524{532, 1990.
W. Finno , F. Hergert, and H. G. Zimmermann. Improving model sele tion by non onvergent methods. Neural Networks, 6:771{783, 1993.
1361

Stra uzzi and Utgoff

M. Frean. A \thermal" per eptron learning rule. Neural Computation, 4(6):946{957, 1992.
J. H. Friedman and J. W. Tukey. A proje tion pursuit algorithm for exploratory data
analysis. IEEE Transa tions on Computers, C-23(9):881{889, 1974.
S. I. Gallant. Per eptron-based learning. IEEE Transa tions on Neural Networks, 1(2):
179{191, 1990.
D. Goldberg. Geneti algorithms in sear h, optimization, and ma hine learning. AddisonWesley, Reading, MA, 1989.
M. A. Hall. Correlation-based feature sele tion for ma hine learning. PhD thesis, Department of Computer S ien e, University of Waikato, Hamilton, New Zealand, 1999.
B. Hassibi and D. G. Stork. Se ond order derivatives for network pruning: Optimal brain
surgeon. In Advan es in Neural Information Pro essing Systems 5. Morgan Kaufmann,
1993.
I. Inza, P. Larranaga, R. Etxeberria, and B. Sierra. Feature subset sele tion by Bayesian
network-based optimization. Arti ial Intelligen e, 123(1-2):157{184, 2000.
G. H. John, R. Kohavi, and K. P eger. Irrelevant features and the subset sele tion problem.
In Ma hine Learning: Pro eedings of the Eleventh International Conferen e, pages 121{
129, New Brunswi k, NJ, 1994. Morgan Kaufmann.
R. Kerber. Chimerge: Dis retization of numeri attributes. In Pro eedings of the Tenth
National Conferen e on Arti ial Intelligen e, pages 123{128, San Jose, CA, 1992. MIT
Press.
R. King, C. Feng, and A. Shutherland. Statlog: Comparison of lassi ation algorithms on
large real-world problems. Applied Arti ial Intelligen e, 9(3):259{287, 1995.
K. Kira and L. Rendell. A pra ti al approa h to feature sele tion. In D. Sleeman and
P. Edwards, editors, Ma hine Learning: Pro eedings of the Ninth International Conferen e, San Mateo, CA, 1992. Morgan Kaufmann.
J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear
predi tion. Information and Computation, 132(1):1{64, 1997.
R. Kohavi. Wrappers for performan e enhan ement and oblivious de ision graphs. PhD
thesis, Department of Computer S ien e, Stanford University, Stanford, CA, 1995.
R. Kohavi and G. H. John. Wrappers for feature subset sele tion. Arti ial Intelligen e, 97
(1-2):273{324, 1997.
D. Koller and M. Sahami. Toward optimal feature sele tion. In Ma hine Learning: Pro eedings of the Fourteenth International Conferen e, pages 284{292. Morgan Kaufmann,
1996.
1362

Randomized Variable Elimination

M. Kubat, D. Flotzinger, and G. Pfurts heller. Dis overing patterns in Eeg signals: Comparative study of a few methods. In Pro eedings of the European Conferen e on Ma hine
Learning, pages 367{371, 1993.
P. Langley and S. Sage. Oblivious de ision trees and abstra t ases. In Working Notes of
the AAAI-94 Workshop on Case-Based Reasoning, Seattle, WA, 1994. AAAI Press.
Y. LeCun, J. Denker, and S. Solla. Optimal brain damage. In Advan es in Neural Information Pro essing Systems 2, pages 598{605, San Mateo, CA, 1990. Morgan Kaufmann.
N. Littlestone. Learning qui kly when irrelevant attributes abound: A new linear-threshold
algorithm. Ma hine Learning, 1988.
H. Liu and R. Setino. A probabilisti approa h to feature sele tion: A lter solution.
In Ma hine Learning: Pro eedings of the Fourteenth International Conferen e. Morgan
Kaufmann, 1996.
H. Liu and R. Setiono. Feature sele tion via dis retization. IEEE Transa tion on Knowledge
and Data Engineering, 9(4):642{645, 1997.
O. Maron and A. W. Moore. Hoe ding ra es: A elerating model sele tion sear h for
lassi ation and fun tion approximation. In Advan es in Neural Information Pro essing
Systems, volume 6. Morgan Kaufmann, 1994.
M. Mit hell. An introdu tion to geneti algorithms. MIT Press, Cambridge, MA, 1996.
M. Mit hell, T. Ma hine learning. MIT Press, 1997.
J. Moody and J. Utans. Ar hite ture sele tion for neural networks: Appli ation to orporate
bond rating predi tion. In A. N. Refenes, editor, Neural Networks in the Capital Markets.
John Wiley and Sons, 1995.
J. R. Quinlan. Indu tion of de ision trees. Ma hine Learning, 1(1):81{106, 1986.
J. R. Quinlan. C4.5: Ma hine learning programs. Morgan Kaufmann, 1993.
F. Rosenblatt. The per eptron: A probabilisti model for information storage and orginization in the brain. Psy hologi al Review, 65:386{407, 1958.
D. E. Rumelhart and J. L. M Clelland. Parallel distributed pro essing. MIT Press, Cambridge, MA, 1986. 2 volumes.
L. Thurstone. Multivariate data analysis. Psy hologi al Review, 38:406{427, 1931.
P. E. Utgo and D. J. Stra uzzi. Many-layered learning. Neural Computation, 14(10):
2497{2529, 2002.
H. Vafaie and K. De Jong. Geneti algorithms as a tool for restru turing feature spa e
representations. In Pro eedings of the International Conferen e on Tools with AI. IEEE
Computer So iety Press, 1995.
1363

Stra uzzi and Utgoff

L. G. Valiant. A theory of the learnable. Communi ations of the ACM, 27(11):1134{1142,


1984.
P. Werbos. Ba kpropagation: Past and future. In IEEE International Conferen e on Neural
Networks. IEEE Press, 1988.
J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature
sele tion for SVMs. In Advan es in Neural Information Pro essing Systems 13, pages
668{674. MIT Press, 2000.

1364

S-ar putea să vă placă și