Documente Academic
Documente Profesional
Documente Cultură
JEFF K. ELSER
Montana State University
Department of Computer
Science
Bozeman, MT
elser@cs.montana.edu
ABSTRACT
The standard formulation of support vector machines (SVMs)
does not allow for missing values for any of the attributes in an
example being learned or classified.
We examine
preprocessing methods to prepare data sets containing missing
values for use by an SVM. These methods can be separated
into two categories: ignoring the missing data (discarding
training examples with a missing attribute or discarding
attributes that have missing values), or using a process known
as imputation to generate missing values. A simple imputation
method is just to use the average value for the attribute, but
there are more robust techniques, including a k-nearest
neighbors (KNN) approach and the use of a separate SVM to
determine the likely value. We compare the performance of an
SVM on test datasets containing missing values using each of
the above techniques, and finally we propose a topic for further
research - a possible method of handling the missing values
directly within the SVM learning algorithm.
INTRODUCTION
2
using the attributes mean value; imputing the missing values using KNN; and
imputing the missing values using a separate SVM.
DATA SETS
When we generated the training data sets, we also generated test data sets
with the same distribution for each attribute. After training an SVM using the
training data, we tested it against the test data set and measured the accuracy.
INTRODUCING MISSING VALUES
In the generated data, a random attribute was picked and then rows of that
attribute were randomly picked to be missing. The number of columns and rows
used were explicitly controlled by us, but the actual position of the columns and
rows was random.
HANDLING MISSING VALUES
The first two methods by which we deal with missing values in the data set
are very simple: just ignore training examples that contain missing values, or
ignore attributes that have missing values in any of the training examples.
Note that the first method causes problems if any examples to be classified
after learning is complete contain missing values; they must either be ignored, or
the missing values must be generated using one of the following methods.
This technique is not significantly more complicated than the first two. If a
training example has a missing value, that value is replaced by the most
common value for that attribute. For continuous attributes, we use the mean or
average of all the specified values for that attribute in the training data set. (This
does impose the requirement that two passes be made over the data, one to
determine the mean and another to update any missing values.) For discrete
attributes, we use the mode, or most commonly occuring, value for that attribute.
K-Nearest Neighbors (KNN) Imputation
An SVM can be used to impute the missing values as well. For each
attribute in the training set that has missing values, we train an SVM using all of
the training examples that have no missing values; we ignore the original
classification value from the data set and use the value of the attribute being
imputed is the target value. We also ignore any other attributes that have any
missing values when generating this new training data.
If the attribute is a continuous value, we use SVM-Light in regression mode
to learn the data. If the attribute is discrete with only two values, a standard
SVM in classification mode can be used. For a discrete attribute with more than
two values, special handling is necessary standard SVMs, including SVMLight, do not do multi-category classification. We use the standard technique of
one-against-the-others to handle multi-valued discrete attributes; if the attribute
has n values, we generate n separate training data sets; in data set i, the
classification value for each example indicates whether or not the example had
value i for the attribute.
After an SVM is trained on each data set, we then use that SVM to classify
or perform regression on the examples with missing values for the attribute.
Again, we ignore any other attributes that have missing values. If the attribute
being imputed is continuous, the SVM will use regression to generate the value.
If the attribute is continuous, we need to classify it with each of the n SVM
models and select the value corresponding to the SVM that classifies the
4
example as positive. If more than one SVM generates a positive classification,
we randomly select one value.
RESULTS
The results of processing each of our data sets containing missing data
using each of the above methods is summarized below.
For the generated data, we averaged the accuracies for 10 varied data sets,
for each imputation technique. As would be expected, the control where the
actual data was left in had the best accuracy. The next best technique was using
information from the model to pick values. This involved replacing missing
values with the expected center of the model. The use of an SVM to impute the
values was next best. Effectivly tieing for fourth were three more realistic
techniques: discard attributes with missing values, k-nearest neighbor, and
replacing values with the mean. Since it is would be uncommon to have
information about the model to simply pick good values, these last three are
probably the dominant choices. Replacing with a 1 or -1 were inferior as
expected. The worst technique was discarding rows. This could be partly
because we chose our missing values by focusing on single columns. That
would cause the missing values to be distributed more evenly between the rows,
and thus cause a greater loss of data upon removing those rows.
Method
Average Accuracy (%)
Control
93.19
Replace with 0
88.92
Replace with 1
80.02
Replace with -1
80.57
Discard Examples
68.5
Discard Attributes
86.66
Mean
86.76
KNN
86.75
SVM
87.15
Table 1
TOPICS FOR FURTHER RESEARCH
5
drawing a line between the closest points on the two convex hulls, and drawing
a plane perpendicular to this line halfway between the convex hulls.
Consider adding a new data point which has a missing value for an
attribute. When this data point is mapped to the higher-dimensional space, if it
lies within the boundaries of its classifications convex hull, regardless of the
value which could be assigned to the missing attribute, or if it extends the
convex hull in a direction away from the convex hull enclosing the other
classification, the point can be ignored; it cannot be a support vector. If the data
point is mapped in such a way that it extends the convex hull in a direction
toward the other convex hull, it may be significant. If varying the value of the
missing attribute within some range of values moves the point along a line that
is parallel to the closest face of the other convex hull, then it is safe to assign any
of those values to the missing attribute.
The difficulty with this analysis is that the SVM does not actually map the
points to the higher-dimensional space; it evaluates a kernel function that is
chosen such that its result is proportional to the inner product in the higherdimensional space. The values obtained from the kernel function are supplied to
a quadratic programming optimization algorithm to select the support vectors.
In order to handle examples with missing values within the SVM as it is
learning, it might be possible to somehow tag the values and generate a range of
results from the kernel evaluation, then modify the quadratic optimization
algorithm to determine whether the point should be a candidate for a support
vector.
For a detailed introduction to the workings of SVMs, see (Chen, 2003). For
more detailed analysis of SVMs and their applications, see (Cristianini, 2000)
and (Schlkopf, 2002).
CONCLUSIONS
6
ACKNOWLEDGEMENTS
The experiments conducted for this research used the SVM Light package,
written by Thorsten Joachims (Joachims, 1999). For information about the
package, and to download, visit http://svmlight.joachims.org/.
REFERENCES
Chen, P.H., C.-J. Lin, and B. Schlkopf, 2003, A Tutorial on Nu-Support Vector Machines,
http://www.csie.ntu.edu.tw/~cjlin/papers/nusvmtutorial.pdf.
Cristianini, N. and J. Shawe-Taylor, 2000, An Introduction to Support Vector Machines, Cambridge
University Press, 2000.
Joachims, T., 1998, Making Large-Scale SVM Learning Practical, in B. Schlkopf, C. Burges,
and A. Smola (eds.), Advances in Kernel Methods - Support Vector Learning, MIT-Press,
Cambridge, MA, 1999. http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_99a.pdf.
Schlkopf, B. and A. J. Smola, 2002, Learning with Kernels, MIT Press, Cambridge, MA, 2002.