Sunteți pe pagina 1din 15

.

1
SVM and CO-TRAINING
Vincent WU
July 2002
Abstract
This report concerns SVM and co-train. It demonstrates some exper-
imental results on a set of public datasets by using SVM toolboxes and
some codes from Dr,Li. Also it includes some explanation to the exper-
imental results, as well as some underlying theoretical insights on SVM
and Co-train personally interpreted, or learned from literature and Dr,Li.
Contents
1 Introduction to SVM 3
1.1 Linear Support Vector Machine . . . . . . . . . . . . . . . . . . . 3
1.2 Lagrangian Optimization Theory . . . . . . . . . . . . . . . . . . 3
1.3 Imperfect separation . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Decision function . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Quadratic Programming Problem (QP Problem) . . . . . . . . . 6
1.6 Nonlinear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 SVM experiments 7
2.1 the Matlab skeleton for SVM . . . . . . . . . . . . . . . . . . . . 7
2.2 OSU SVM Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 investigate CRAB data - Binary Classier . . . . . . . . . . . . . 9
2.4 investigate IRIS data - Multi-class case . . . . . . . . . . . . . . . 9
3 co-training 11
3.1 Multi-class formulation and query optimization for SVMs through
computational geometry approach . . . . . . . . . . . . . . . . . 13
2
1 Introduction to SVM
This introduction to Support Vector Machine is based on [4],[13],[1],[10],[8].
1.1 Linear Support Vector Machine
Given a training set {x
i
, y
i
}, i = 1, ..., l, y
i
{1, +1}, x
i

d
, a separating
hyperplane w x + b = 0 classes the training set such that:
w x
i
+ b +1 if y
i
= +1 (1)
w x
i
+ b 1 if y
i
= 1 (2)
The geometric interpretation of (1) & (2) is that the positive and negative
training points which most close to the separating hyperplane are exactly on
the two hyperplanes w x + b = +1 , w x + b = 1 respectively (we can call
the two hyperplanes as boundaries). (Note that the closest points can always
be on such two hyperplanes by scaling the w and b as long as the training set
is linearly separable). The separating margin thus =
|+1|
||w||
+
|1|
||w||
=
2
||w||
. So
the goal of nding the separating hyperplane with the largest margin can be
achieved by minimizing the ||w||.
The (1) & (2) can be combined into one set of inequalities:
y
i
(w x
i
+ b) 1 0 i (3)
The goal now is:
minimize
1
2
||w||
2
=
1
2
< w, w > subject to (3) (4)
1.2 Lagrangian Optimization Theory
Introducing Lagrange multipliers
1
,
2
, . . . ,
l
0, we can construct the fol-
lowing Lagrangian function:
L(w, b, ) =
1
2
< w, w >
l

i=1

i
(y
i
(< w, x
i
> +b) 1), (5)
L has to be minimized with respect to the primal variables w and b and
maximized w.r.t dual variables . That is,
(w, b, ) = argmin
(w,b)
argmax
(0)
L(w, b, ).
At the extremum, minimizing L w.r.t w results in

w
L = 0 w

=
l

i=1

i
y
i
x
i
(6)
Similarly minimizing L w.r.t b results in

b
L = 0
l

i=1
y
i

i
= 0 (7)
3
by using the Lagrange theorem:
Theorem 1.1. Theorem (Lagrange (Equality Constraints)): Given an objective
function f(w) and equality constraints h
i
(w), i = 1, ..., n,, augment the objective
function to form the Lagrangian function L(w, ) f(w) +

n
i=1

i
h
i
(w), the
necessary condition for a normal point w

to be a minimum of f(w) is

w
L(w

) = 0 (8)

L(w

) = 0 (9)
Now the (5) can be substituted into
L(w

, b, ) =
1
2

i,j
y
i
y
j

j
< x
i
, x
j
> +

i
= () (10)
And then the goal becomes a Wolfe dual instead and can be solved as a
Quadratic Programming Problem:
maximizing (10) w.r.t ,
subject to the constraints (7) and 0
In the report, we will not discuss how to get by tackling the QP problem.
Instead, we just discuss the property of using the KKT-conditions.
Theorem 1.2. Theorem (Karush-Kuhn-Tucker(KKT)): Given an optimization
problem:
minimize f(w)
subject to g
j
(w) 0, j = 1, . . . , m
and h
i
(w) = 0, i = 1, . . . , n.
Necessary and sucient conditions for a normal point w

to be an optimum
is the existence of coecients


m
,


n
such that with
L(w, , ) = f(w) +

i
h
i
(w) +

j
g
j
(w), (11)

w
L(w

) = 0 (12)

L(w

) = 0 (13)
and for all j = 1, . . . , m (14)

j
g
j
(w

) = 0, g
j
(w

) 0,

j
0 (15)
4
The (5) can be formulated to a KKT problem:
L(w, b, ) =
1
2
< w, w > +
n

i=1

i
(1 y
i
(< w, x
i
> +b))
+
m

j=1

j
(1 y
j
(< w, x
j
> +b)),
where n is the number of data points which satisfy
1 y
i
(< w, x
i
> +b) = 0;
and the m is the number of data points which satisfy
1 y
j
(< w, x
j
> +b) < 0;
so for all j = 1, . . . , m

j
g
j
(w

) = 0, g
j
(w

) < 0,

j
= 0
The physical meaning of the KKT conditions for our problem is that those
correctly partitioned points outside the boundaries have corresponding
i
= 0.
It means those points have no inuence on the formulation of the separating
hyperplane.
Some words can be added to explain why
i
= 0 for those points outside
boundaries which meet the criteria 1y
i
(wx
i
+b) < 0: to achieve the maximum
of (5) w.r.t variable , we should eliminate any value-decrement caused by the
term 1 y
i
(w x
i
+ b) by introducing the zero multiplier
i
. Similarly, if
1 y
i
(w x
i
+ b) > 0, the
i
will be sharply increased to positive innity to
enlarge the value of (5) w.r.t as much as possible. But such an innite value
makes the original goal meaningless, so the case of crossing boundaries will be
prohibited if we strictly require no points can fall between the two boundaries.
However, any point exactly on the margins (1 y
i
(w x
i
+ b) = 0) will be
associated with a positive value. Those points on the boundaries are then
taken as Support Vectors.
1.3 Imperfect separation
Even if the training points are linear separable, strictly requiring all training
points lie on or outside the boundaries might make the margin distances ex-
tremely small, which will severely degrade the generalization performance. So
we sometimes only require the training points meeting the following criteria:
x
i
w+ b +1 +
i
for y
i
= +1 (16)
x
i
w+ b 1 +
i
for y
i
= 1 (17)

i
>= 0 i.
So we would like to add to the objective function a penalizing term:
minimize
w,b,
1
2
< w, w > +C(
l

i
) (18)
5
Besides the above (18), various ways to corporate the penalty with the objective
function are proposed.
We then construct a new Lagrangian
L(w, b,
i
, , ) =
1
2
< w, w > +C
l

i=1

i=1

i
[y
i
(< w, x > +b) (1
i
)]
l

i=1

i
0
There is an additional stationary condition

i
L(w, b, , , ) = 0
i
= C
i
0
Now the dual () still be (10) due to the elimination of
i
and
i
during the
substitution. The only dierence is that the
i
is bounded by [0, C] now. Also
the
i
associated with the training points which are enclosed between the two
boundaries are positive too, so those training points will also be regarded as
Support Vectors.
1.4 Decision function
So the decision function is
sgn(
l

i=1
y
i

i
x
i
x + b))for a testing point x .
and
w

=
Ns

i=1

y
i
x
i
where Ns is the number of support vectors.
The bias b can be determined by doing calculation based on any j which
exactly lies on the boundaries. For example, by combining
w

l
i=1

i
y
i
x
i
and y
j
(w x
j
+ b) = 1
we get b = y
i
w x
i
= y
i
<

1
i=1

i
y
i
x
i
, x
j
>
= y
i

l
i=1

i
y
i
< x
j
, x
i
>
And it is numerically safer to take the mean/median value of b resulting
from all such equations.
1.5 Quadratic Programming Problem (QP Problem)
to be added
6
1.6 Nonlinear SVM
To be exact, actually there are two ways to handle the nonlinear separability.
The rst one is to allow imperfect separation. When
i
>= 1, the separating
hyperplane partitions the training data x
i
into the wrong side at the cost of
C
i
. But it is a very weak solution to the nonlinear separability and only used
after the following second method is adopted: mapping the data points x
i
to
another high dimensional space such that the new points (x
i
) are (nearly)
linearly separable.
Recall that the (6) tells us the w can be expressed as a linear combination of
the training points, so the decision function for a newly added testing data x only
cares about the product among x and x. So after the mapping, the SVM only
cares about (x
i
) (x
j
) too, by supposing we have ker(x
i
, x
j
) = (x
i
) (x
j
),
then no explicit information about the () is needed as long as we know that
the kernel function ker(x
i
, x
j
) is equivalent to the dot product of some other
high dimensional spaces.
2 SVM experiments
2.1 the Matlab skeleton for SVM
Here is a simple MATLAB prototyping program to train SVM
function[alpha,b,minimumQuadraticProgramingValue,
supportVectors,vectorsOnMargin,margin]= trainsvm(x,y)
C = 1e+10;
% the larger C, the more penalty given to data points falling
% between two margins.By default, a large (infinite) value is used
% to achieve a perfect separation.
% The set of data points has dimension L x d
[L,d] = size(x);
% The label set has dimension L x 1
[L] = size(y,1);
% the kernel function operates on all pairs of input vectors
% a 2-order polynomial kernel is used by default,Dim(K)=LxL now
K = (1 + x*x).^2;
% use a quadratic programing problem solver in MATLAB
[alpha,minimumQuadraticProgramingValue,exitflag] =
quadprog(diag(y)*K*diag(y),-ones(L,1),
[],[],y,0,zeros(L,1),C*ones(L,1));
alphaY = alpha.*y;
% Find the support vectors and
7
% the vectors exactly on the margin
% Allow for some inacuracy
eps = 0.001*max(alpha);
% Support vectors which have alpha in (0+eps,C]
supportvectors = find(alpha >= eps);
% those exactly on the margin which have alpha in (0+eps,C-eps)
vectorsOnMargin = find(alpha >= eps & alpha <= C-eps);
b = median(y(vectorsOnMargin)-K(vectorsOnMargin,:)*alphaY);
% margin = 2/||w|| where ||w|| = w*w, w = sigma(alpha*y*X)
% the programmable code for w = (alpha.*y) * X
margin = 2/sqrt(alphaY*K*alphaY);
% assign label
newy = sign( ((1 + newx*x).^2)*(alpha.*y)+ b );
The SVM reqires approximately cL
2
time to train, where L is the number
of training examples, and c is an algorithm-dependent constant. Also the SVM
requires Ns kernel computations to query the class of a newly added testing
point.
There are many methods to reduce the computation complexity of SVM.
SMO (Sequential Minimal Optimization for SVM) avoids the use of the b
when doing the KKT condition check. And it works by optimizing two
i
s
and xing others at a time.([6]) reduces the computation for classifying query
examples by varying the form of the decision function.
Some experiment schemes are also proposed to reduce the computation but
with the nearly same accuracy. The [12] concatenates the data points in each
class from the number of
l
m
in average to (
l
m
)
0.5
, where m is the number of
classes for a mulit-class task. So the computation is reduced to cml. The [3]
discards the feature x
.j
associated with a small weight value w
j
after obtaining a
SVM based on a small portion of training vectors. To be rigid,I think [3] should
discard the feature has a small w
j
x
.j
but not solely based on the amount of
w
j
.
2.2 OSU SVM Toolbox
The OSUSVM
1
is a MATLAB SVM toolbox based on the C++ package SVM-
LIB
2
. It retains the high eciency of the SVMLIB but at the same time has
the convenience brought from MATLAB.
Both OSUSVM ver2.0 and ver3.0 are used when I am conducting the series
of experiments. Some slight dierences are observed: the ver2.0 has more bugs
1
http://eewww.eng.ohio-state.edu/~maj/osu_svm/
2
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
8
Kernel Parameter Rates Margin
poly d = 1 0.87(0.95) 0.0088
poly d = 2 0.88(0.95) 0.0115
poly d = 3 0.85(0.97) 0.0084
poly d = 4 0.84(0.98) 0.0110
RBF d = 10 0.84(0.98) 0.0106
RBF d = 1 0.86(0.96) 0.0076
RBF d = 0.1 0.86(0.93) 0.0063
RBF d = 0.01 0.88(0.93) 0.0044
Table 1: CRAB data, OSU2.0
and uses the one-against-all strategy to do multi-class classication, but the
ver3.0 uses the one-against-one method .
A standard way to use the OSUSVM box ver2.0 is as follows:
[AlphaY,SupportVectors,b,Parameters,Ns]=
PolySVC(x,y,orderP,Gamma,Coeff,C);
%where the format of input is: dim(x) = [d,l] ; dim(y)= [1,l].
%The polynomial kernel function is
%k(x1,x2) = (Coeff + Gamma<x1,x2>)^orderP.
%When the orderP = 1, the kernel is linear.
%or
[AlphaY,SupportVectors,b,Parameters,Ns]=RbfSVC(x,y,Gamma,C);
%the Rbf kernel function is k(x1,x2) = exp(-Gamma*|x1-x2|^2).
%Gamma = Coeff = C = 1 by default
%The osuSVM toolbox does not provide the routine to calculate
%the margin. A line of code is added to fulfill the task:
margin = 1/sqrt(AlphaY*k(Svc,Svc)*AlphaY);
2.3 investigate CRAB data - Binary Classier
The CRAB data can be classied into two classes according to their sexes. First
we start with the CRAB data set by initially selecting only two features of each
CRAB data vector. The linear, polynomial and rbf kernels are used.
we then train the CRAB data with all ve features.
2.4 investigate IRIS data - Multi-class case
The IRIS data is a multi-class data set. The totally 150 data vectors with 4
features each can be grouped into three classes. We use the rst 25 data of each
class as the training data and the remaining 25 data of each class as the test
data. And set C=5000.
9
Kernel Parameter Rates Margin
poly d = 1 0.9467(0.9733) 0.1069
poly d = 2 0.9333(1.0000) 0.3280
poly d = 3 0.9333(1.0000) 4.7778
poly d = 4 0.9200(1.0000) 56.8478
poly d = 5 0.9200(1.0000) 56.8478
poly d = 6 0.8800(1.0000) 6596.1
RBF r = 10 0.9333(1.0000) 243.3521
RBF r = 1 0.9467(1.0000) 1.9121
RBF r = 0.1 0.9467(1.0000) 0.1786
RBF r = 0.01 0.9467(1.0000) 0.0094
Table 2: IRIS classication by OSUSVM2.0
Kernel Parameter Rates Margin
poly d = 1 0.9333(1.0000) NaN
poly d = 2 0.9333(1.0000) NaN
poly d = 3 0.9333(1.0000) NaN
poly d = 4 0.9200(1.0000) NaN
poly d = 5 0.9200(1.0000) NaN
poly d = 6 0.9200(1.0000) NaN
RBF r = 10 0.9333(1.0000) NaN
RBF r = 1 0.9467(1.0000) NaN
RBF r = 0.1 0.9333(1.0000) NaN
RBF r = 0.01 0.9333(1.0000) NaN
Table 3: IRIS classication by OSUSVM3.0
10
Note that the OSUSVM toolbox2.0 use the one-against-all approach in
which n SVM models (n is the total classes) are constructed and the i-th SVM
is trained with all of the examples in the i-th class with positive labels, and
all other examples with negative labels. So after the complete SVM training,
a nxN (N is the total number of training vectors) matrix holding the decision
value of each SVM model is returned. The computation is up to cN
3
.
Another method for multi-class classication is the one-against-one ap-
proach in which n(n-1)/2 classiers are constructed and each one trains data
from two dierent classes of the totally n classes. Although more classier are
constructed, the training data set learned by each classier is smaller (only data
from two classes), so the computational complexity can be kept at a lower level
of
n(n1)
2
c(
2N
n
)
2
2cN
2
. The next question is whether the query time for
classifying a new points is much longer or not.
But the nal decision making for labelling is complicated. A voting strategy
is used: the points in any class i will be trained n-1 times and be designated
to be a class (either label i or the label of its pairwise training class during the
n(n-1)/2 times classier formulation). At the end each points are assigned to
be in a class with maximum number of votes.
After the classication job is nished, a set of hyperplane are piecewise
combined into a general classier for future use. Thus the generalization measure
based on margin is not so clear as the two-category case.
ECOC ( Error correcting output coding ) [5] which is identical to the decod-
ing step for ECC(Error-correcting codes) is proposed by (Dietterich & Bakiri,1995)
to give a unifying framework for multiclass classication. The steps can be
concluded as follows: for a group of training data from m classes, create an
m n ECOC matrix whose value can be anyone of +1, 0, 1. For each column,
we choose those training points in each class associated with an ECOC matrix
value +1 as positive training samples, and choose those of 1 as negative train-
ing samples. Those of zero value are not used in the training. After all the
columns are used, n binary classiers are formulated, each corresponds to the
training data partition ruled by one column. Any testing data will be labelled
by the n classiers and then a code with length n will be returned, so we can
classify the testing point to the class whose corresponding ECOC row vector is
most similar with the predicted code of the testing point. Both the one-vs-one
and one-vs-all then can be described by ECOC matrices. Other matrices like
BCH are also used [11]. The matrix design is also a research area which can be
in connection to Experimental Design.
3 co-training
co-training by combining two classiers,where each classier operating on the
training vectors with full features , is introduced by [9], which is dierent from
the one by splitting the features into two parts and then introducing two classi-
er training on a dierent feature view [2]. Besides using two or more classiers
in parallel, co-train is mainly characterized by using the unlabelled data. That
11
is , one classier labels a set of unlabelled data which are then used by the other
classier, and vise visa.
It makes sense that the co-train operating on two redundant views will
work.But the rst one by simply using two classiers receives some comments
about its assumption and experimental results [7]. It is experimentally veried
that the performance is not so good in my implementation by using two SVM
classifers with dierent kernels. But an interesting topic about choosing the
most useful/condent/informative unlabelled data for co-training is generated.
Given several training samples, two SVMs with the linear and the 2nd order
polynomial kernel, are trained. The two constructed hyperplane are then sup-
ported by a set of support vectors. In fact, those support vectors are more likely
misclassied than those vectors out of margin. In other words, they are more
informative. Furthermore, those data points which are regarded as support
vectors by both SVMs are more informative for constructing the nal classi-
er. Also among those more informative support vectors, the vector labelled
dierently by the two SVMs are most likely useful for nalizing the combining
classier.
12
3.1 Multi-class formulation and query optimization for
SVMs through computational geometry approach
multiclass SVM = prototyping a hyperplane arragement +ne tuning the pro-
totyping arrangement by SVMs.
ECOC code is to propose an arrangement of n hyperplanes and then choose
m convex regions to hold training data. By spanning some neighboring regions
(corresponding to the distance measure for code), testing points which are not
in the m regions can be classied.
Query for the class of a testing point is the point location(or half-range
search) problem ? Is there an algorithm with linear space O(n), and logarithmic
query time?
Introducing multiple SVM
binary classiers into the feature space, actually can be regarded as the classical
hyperplane arrangement problem in Computational Geometry.
Arrangement of hyperplane: Let N be a nite set of hyperplanes
in R
d
. The arrangement H(N) formed by N is the natural partition
of R
d
into many convex regions.
So any ECOC code matrix is equivalent to a scheme of hyperplane
arrangement. Each hyperplane constructed by SVMs separates the
feature space to two halves labeled with either 1 and 1. All n ordered
hyperplanes given any convex region a 1, 1 code length in n. Such a
code might correspond to a row of the ECOC code matrix.
So the ECOC approach can be given geometrical interpretation.
Once we specify a code matrix with size mxn, a prototyping hyper-
plane arragement is proposed. Among all the convex regions, m re-
gions are expected to hold the training data.
The ECOC can be arbitrarily dened since it is introduced before
we import any training data to learn SVMs. For example, we can
assume any hyperplane is orthogonal to each other at rst. That
corresponds to the one-vs-all code matrix. we then train n SVMs
to make the hyperplane arrangement match the distribution of the
training data. That is, each SVM ne-tunes a hyperplane by trans-
lation and rotation. After the training stage, the space for any class
m
i
is distorted to a xed convex region.
The query for the class of a testing point can be nished in two
steps. First, we check the alignment of the testing point to each of
the n hyperplanes. Then we class the testing point to a category
based on the code calculated from the rst step. Note that because
of m 2
n
, many codes might not be able to match the code of a
class m
i
exactly, the distance concept then is used to nd the closest
mapping. The Hamming distance is the comparison at each same bit
position. Geometrically, we unify some neighboring convex regions.
The query can be actually regarded as the point location (also
range search) problem in computational geometry. But in higher di-
mension, known point location methods do not achieve both linear
13
space and logarithmic query time. As in our SVM multi-class query
case, n hyperplanes are contructed and m(m 2
n
) convex union re-
gions are specied. And each query has to check with all hyperplanes
and takes O(n).
14
References
[1] A. S. Bernhard Schlkopf. Tutorial for book: Learning with ker-
nels support vector machines, regularization, optimization and
beyond.
[2] A. Blum and T. Mitchell. Combining labeled and unlabeled data
with co-training. In COLT: Proceedings of the Workshop on
Computational Learning Theory, Morgan Kaufmann Publishers,
1998.
[3] J. Brank, M. Grobelnik, N. Milic-Frayling, and D. Mladenic.
Interaction of feature selection methods and linear classication
models, 2002.
[4] C. J. C. Burges. A tutorial on support vector machines for
pattern recognition. Data Mining and Knowledge Discovery,
2(2):121167, 1998.
[5] R. G. Center. Using error-correcting codes with co-training for
text classication with a large number of categories.
[6] D. DeCoste. Anytime interval-valued outputs for kernel ma-
chines: Fast support vector machine classication via distance
geometry.
[7] F. Denis and R. Gilleron. Ecml/kddd2001 tutorial:co-training
and learning from labeled data and unlabeled data.
[8] X. Ge. A c++ implementation of john c. platts sequential min-
imal optimization (smo) for training a svm.
[9] S. Goldman and Y. Zhou. Enhancing supervised learning with
unlabeled data. In Proc. 17th International Conf. on Machine
Learning, pages 327334. Morgan Kaufmann, San Francisco, CA,
2000.
[10] T. Hofmann. Cs 295-3:machine learning and pattern recognition,
2001.
[11] R. R. J. Rennie. Improving multi-class text classication with
the support vector machine, 2001.
[12] L. Shih, Y. han Chang, J. Rennie, and D. Karger. Not too hot,not
too cold: the bundled-svm is just right, 2002.
[13] J. R. Tommi Jaakkola. 6.867 machine learning coursework3,
2001.
15

S-ar putea să vă placă și