Sunteți pe pagina 1din 6

1

IMPUTATION OF MISSING DATA FOR INPUT TO SUPPORT


VECTOR MACHINES
BOB L. WALL
Montana State University
Department of Computer
Science
Bozeman, MT
bwall@cs.montana.edu

JEFF K. ELSER
Montana State University
Department of Computer
Science
Bozeman, MT
elser@cs.montana.edu

ABSTRACT
The standard formulation of support vector machines (SVMs)
does not allow for missing values for any of the attributes in an
example being learned or classified.
We examine
preprocessing methods to prepare data sets containing missing
values for use by an SVM. These methods can be separated
into two categories: ignoring the missing data (discarding
training examples with a missing attribute or discarding
attributes that have missing values), or using a process known
as imputation to generate missing values. A simple imputation
method is just to use the average value for the attribute, but
there are more robust techniques, including a k-nearest
neighbors (KNN) approach and the use of a separate SVM to
determine the likely value. We compare the performance of an
SVM on test datasets containing missing values using each of
the above techniques, and finally we propose a topic for further
research - a possible method of handling the missing values
directly within the SVM learning algorithm.
INTRODUCTION

The standard formulation of support vector machines (SVMs) does not


allow for missing values for any of the attributes in an example being learned or
classified. We examine methods by which data sets containing missing values
can be processed using an SVM. This is typically accomplished by one of two
means: ignoring the missing data (either by discarding examples with a missing
attribute value or discarding an attribute that has missing values), or using a
process generally referred to as imputation, by which a value is generated for the
attribute. These techniques are typically carried out on the data prior to its being
supplied to the learning algorithm. A simple imputation method is just to use the
average value for the attribute, but there are more robust techniques that are
likely to supply a value closer to the one that is missing, including a k-Nearest
neighbors (KNN) approach and the use of a separate SVM to determine the
likely value.
We compare the performance of an SVM on test datasets containing
missing values using the following techniques: ignoring examples with missing
values; ignoring attributes with missing values; imputing the missing values

2
using the attributes mean value; imputing the missing values using KNN; and
imputing the missing values using a separate SVM.
DATA SETS

In order to have control over the variables in the experiments, we wrote a


data set generator. The generator calculated the sum of the squares of all the
values in a row of data. If the sum was less than or equal to one, the example
belonged to the class. If the sum was greater than one, it did not belong to the
class. This creates a hyper-sphere shape which is not linearly separable. This is
necessary because the linearly classifiable data we started with was not causing
any error. The error caused by the harder data sets allows us to see differences
in performance better. The number of positive and negative training examples
and test cases were controlled explicitly. The number of attributes was also
explicitly stated for each set.
Two data sets had 20 examples, four had 50 examples, two had 100
examples, one had 1000, and one had 10,000. Each had an equal number of
positive and negative examples. The number of attributes varied between two
and fifteen.
MEASURING SVM PERFORMANCE

When we generated the training data sets, we also generated test data sets
with the same distribution for each attribute. After training an SVM using the
training data, we tested it against the test data set and measured the accuracy.
INTRODUCING MISSING VALUES

In the generated data, a random attribute was picked and then rows of that
attribute were randomly picked to be missing. The number of columns and rows
used were explicitly controlled by us, but the actual position of the columns and
rows was random.
HANDLING MISSING VALUES

Eliminating Missing Values

The first two methods by which we deal with missing values in the data set
are very simple: just ignore training examples that contain missing values, or
ignore attributes that have missing values in any of the training examples.
Note that the first method causes problems if any examples to be classified
after learning is complete contain missing values; they must either be ignored, or
the missing values must be generated using one of the following methods.

Replacing with Mean / Mode Value

This technique is not significantly more complicated than the first two. If a
training example has a missing value, that value is replaced by the most
common value for that attribute. For continuous attributes, we use the mean or
average of all the specified values for that attribute in the training data set. (This
does impose the requirement that two passes be made over the data, one to
determine the mean and another to update any missing values.) For discrete
attributes, we use the mode, or most commonly occuring, value for that attribute.
K-Nearest Neighbors (KNN) Imputation

For this implementation, if a training example contains one or more missing


values, we measure the distance between the example and all other examples
that contain no missing values. Our distance metric is a modified version of the
Manhattan distance the distance between two examples is the sum of the
distances between the corresponding attribute values in each example. For
discrete attributes, this distance is 0 if the values are the same, and 1 otherwise.
In order to combine distances for discrete and continuous attributes, we perform
a similar distance measurement for continuous attributes if the absolute
difference between the two values is less than half a standard deviation, the
distance is 0; otherwise, it is 1.
The K complete examples closest to the example with missing values are
used to choose a value. For a discrete attribute, the most frequently occurring
value is used. For a continuous attribute, the average of the values from the K
neighbors is used. For our experiments, we used K = 5.
SVM Imputation

An SVM can be used to impute the missing values as well. For each
attribute in the training set that has missing values, we train an SVM using all of
the training examples that have no missing values; we ignore the original
classification value from the data set and use the value of the attribute being
imputed is the target value. We also ignore any other attributes that have any
missing values when generating this new training data.
If the attribute is a continuous value, we use SVM-Light in regression mode
to learn the data. If the attribute is discrete with only two values, a standard
SVM in classification mode can be used. For a discrete attribute with more than
two values, special handling is necessary standard SVMs, including SVMLight, do not do multi-category classification. We use the standard technique of
one-against-the-others to handle multi-valued discrete attributes; if the attribute
has n values, we generate n separate training data sets; in data set i, the
classification value for each example indicates whether or not the example had
value i for the attribute.
After an SVM is trained on each data set, we then use that SVM to classify
or perform regression on the examples with missing values for the attribute.
Again, we ignore any other attributes that have missing values. If the attribute
being imputed is continuous, the SVM will use regression to generate the value.
If the attribute is continuous, we need to classify it with each of the n SVM
models and select the value corresponding to the SVM that classifies the

4
example as positive. If more than one SVM generates a positive classification,
we randomly select one value.
RESULTS

The results of processing each of our data sets containing missing data
using each of the above methods is summarized below.
For the generated data, we averaged the accuracies for 10 varied data sets,
for each imputation technique. As would be expected, the control where the
actual data was left in had the best accuracy. The next best technique was using
information from the model to pick values. This involved replacing missing
values with the expected center of the model. The use of an SVM to impute the
values was next best. Effectivly tieing for fourth were three more realistic
techniques: discard attributes with missing values, k-nearest neighbor, and
replacing values with the mean. Since it is would be uncommon to have
information about the model to simply pick good values, these last three are
probably the dominant choices. Replacing with a 1 or -1 were inferior as
expected. The worst technique was discarding rows. This could be partly
because we chose our missing values by focusing on single columns. That
would cause the missing values to be distributed more evenly between the rows,
and thus cause a greater loss of data upon removing those rows.
Method
Average Accuracy (%)
Control
93.19
Replace with 0
88.92
Replace with 1
80.02
Replace with -1
80.57
Discard Examples
68.5
Discard Attributes
86.66
Mean
86.76
KNN
86.75
SVM
87.15
Table 1
TOPICS FOR FURTHER RESEARCH

It is obviously of significant value in certain data sets to avoid throwing


away examples because one or more of their attribute values are missing. Other
attributes that do have values might be significant to learning the classification
criteria. However, if possible, care should be taken to avoid introducing error to
the classification by assigning an incorrect value for the missing attribute.
It may be possible to take advantage of the way in which the SVM
computes the hyperplane that separates categories to better handle missing
attribute values. Consider the theoretical basis for SVM learning positive and
negative examples are mapped to a higher-dimensional Euclidean space (i.e. a
space in which the inner product is defined) in which it is possible to separate
the positive and negative examples with a single hyperplane. The two groups of
examples are each enclosed inside a convex hull, and the hyperplane is built by

5
drawing a line between the closest points on the two convex hulls, and drawing
a plane perpendicular to this line halfway between the convex hulls.
Consider adding a new data point which has a missing value for an
attribute. When this data point is mapped to the higher-dimensional space, if it
lies within the boundaries of its classifications convex hull, regardless of the
value which could be assigned to the missing attribute, or if it extends the
convex hull in a direction away from the convex hull enclosing the other
classification, the point can be ignored; it cannot be a support vector. If the data
point is mapped in such a way that it extends the convex hull in a direction
toward the other convex hull, it may be significant. If varying the value of the
missing attribute within some range of values moves the point along a line that
is parallel to the closest face of the other convex hull, then it is safe to assign any
of those values to the missing attribute.
The difficulty with this analysis is that the SVM does not actually map the
points to the higher-dimensional space; it evaluates a kernel function that is
chosen such that its result is proportional to the inner product in the higherdimensional space. The values obtained from the kernel function are supplied to
a quadratic programming optimization algorithm to select the support vectors.
In order to handle examples with missing values within the SVM as it is
learning, it might be possible to somehow tag the values and generate a range of
results from the kernel evaluation, then modify the quadratic optimization
algorithm to determine whether the point should be a candidate for a support
vector.
For a detailed introduction to the workings of SVMs, see (Chen, 2003). For
more detailed analysis of SVMs and their applications, see (Cristianini, 2000)
and (Schlkopf, 2002).
CONCLUSIONS

Surprisingly, the K-Nearest Neighbors and SVM imputation techniques did


not perform significantly better than the technique of replacing the missing
values with the attribute mean, at least for these data sets. For other more
realistic data sets, we would expect to possibly see increased accuracy using the
more complicated imputation techniques; however, experiments on a database of
automobile data demonstrated similar results (the mean value and KNN
techniques were slightly better than ignoring attributes with missing values). We
encountered some other problems with using the SVM imputation technique that
prevented a meaningful comparison of results on that data set. One problem was
non-convergence in the SVM optimization algorithm; normalizing the values of
each continuous attribute to the range [0, 1] helped somewhat, but normalizing
to [-1,1] might provide better results.
All in all, more experimentation is probably required to better gauge
whether the additional effort required to use an SVM to impute missing values is
worthwhile. It would also be useful to attempt to find heuristics to characterize
the data that would act as a guide for choosing the most appropriate imputation
method.
It would definitely be worthwhile to pursue research into incorporating the
handling of missing values directly into the SVM learning algorithm. However,
there is no guarantee that this will lead to a computationally feasible algorithm.

6
ACKNOWLEDGEMENTS

The experiments conducted for this research used the SVM Light package,
written by Thorsten Joachims (Joachims, 1999). For information about the
package, and to download, visit http://svmlight.joachims.org/.
REFERENCES
Chen, P.H., C.-J. Lin, and B. Schlkopf, 2003, A Tutorial on Nu-Support Vector Machines,
http://www.csie.ntu.edu.tw/~cjlin/papers/nusvmtutorial.pdf.
Cristianini, N. and J. Shawe-Taylor, 2000, An Introduction to Support Vector Machines, Cambridge
University Press, 2000.
Joachims, T., 1998, Making Large-Scale SVM Learning Practical, in B. Schlkopf, C. Burges,
and A. Smola (eds.), Advances in Kernel Methods - Support Vector Learning, MIT-Press,
Cambridge, MA, 1999. http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_99a.pdf.
Schlkopf, B. and A. J. Smola, 2002, Learning with Kernels, MIT Press, Cambridge, MA, 2002.

S-ar putea să vă placă și