Documente Academic
Documente Profesional
Documente Cultură
k(x
i
i , x) = constant.
i
Note that if k(x, y) becomes small as y grows further
away from x , each term in the sum measures the degree
of closeness of the test point x to the corresponding data
base point xi . In this way, the sum of kernels above can
be used to measure the relative nearness of each test point
to the data points originating in one or the other of the sets
to be discriminated. Note the fact that the set of points
x mapped into any hyperplane can be quite convoluted
as a result, allowing much more complex discrimination
between sets which are not convex at all in the original
space.
When data are not labeled, supervised learning is not possible, and an unsupervised learning approach is required,
which attempts to nd natural clustering of the data to
groups, and then map new data to these formed groups.
The clustering algorithm which provides an improvement 2 History
to the support vector machines is called support vector
clustering[2] and is often used in industrial applications
The original SVM algorithm was invented by Vladimir N.
either when data is not labeled or when only some data is
Vapnik and Alexey Ya. Chervonenkis in 1963. In 1992,
labeled as a preprocessing for a classication pass.
Bernhard E. Boser, Isabelle M. Guyon and Vladimir
N. Vapnik suggested a way to create nonlinear classiers by applying the kernel trick to maximum-margin
hyperplanes.[4] The current standard incarnation (soft
1 Denition
margin) was proposed by Corinna Cortes and Vapnik in
[1]
More formally, a support vector machine constructs a 1993 and published in 1995.
hyperplane or set of hyperplanes in a high- or innitedimensional space, which can be used for classication,
regression, or other tasks. Intuitively, a good separation 3 Motivation
is achieved by the hyperplane that has the largest distance
to the nearest training-data point of any class (so-called Classifying data is a common task in machine learning.
functional margin), since in general the larger the margin Suppose some given data points each belong to one of two
the lower the generalization error of the classier.
classes, and the goal is to decide which class a new data
Whereas the original problem may be stated in a nite di- point will be in. In the case of support vector machines, a
mensional space, it often happens that the sets to discrim- data point is viewed as a p -dimensional vector (a list of p
inate are not linearly separable in that space. For this rea- numbers), and we want to know whether we can separate
son, it was proposed that the original nite-dimensional such points with a (p 1) -dimensional hyperplane. This
space be mapped into a much higher-dimensional space, is called a linear classier. There are many hyperplanes
presumably making the separation easier in that space. that might classify the data. One reasonable choice as the
To keep the computational load reasonable, the mappings best hyperplane is the one that represents the largest sepa1
X2
H1
H2
LINEAR SVM
H3
X1
H1 does not separate the classes. H2 does, but only with a small
margin. H3 separates them with the maximum margin.
Linear SVM
data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is
called the margin, and the maximum-margin hyperplane is the hyperplane that lies halfway between them.
These hyperplanes can be described by the equations
w
x b = 1
and
(x1 , y1 ), . . . , (xn , yn )
where the yi are either 1 or 1, each indicating the class
to which the point xi belongs. Each xi is a p -dimensional
real vector. We want to nd the maximum-margin hyperplane that divides the group of points xi for which
yi = 1 from the group of points for which yi = 1 ,
which is dened so that the distance between the hyperplane and the nearest point xi from either group is maximized.
w
x b = 1.
Geometrically, the distance between these two hyper2
planes is w
, so to maximize the distance between the
planes we want to minimize w
. As we also have to
prevent data points from falling into the margin, we add
the following constraint: for each i either
w
xi b 1, if yi = 1
or
w
xi b 1, if yi = 1.
where w
is the (not necessarily normalized) normal vector
b
to the hyperplane. The parameter w
determines the
These constraints state that each data point must lie on
oset of the hyperplane from the origin along the normal
the correct side of the margin.
vector w
.
This can be rewritten as:
4.1
Hard-margin
yi (w
xi b) 1, all for 1 i n.
(1)
If the training data are linearly separable, we can select
two parallel hyperplanes that separate the two classes of We can put this together to get the optimization problem:
3
Minimize w
subject to yi (w
xi b)
1, for i = 1, . . . , n "
by a nonlinear kernel function. This allows the algorithm to t the maximum-margin hyperplane in a transformed feature space. The transformation may be nonlinThe w
and b that solve this problem determine our clas- ear and the transformed space high dimensional; although
the classier is a hyperplane in the transformed feature
sier, x 7 sgn(w
x b) .
space, it may be nonlinear in the original input space.
An easy-to-see but important consequence of this geometric description is that max-margin hyperplane is com- It is noteworthy that working in a higher-dimensional feapletely determined by those xi which lie nearest to it. ture space increases the generalization error of support
vector machines, although given enough samples the alThese xi are called support vectors.
gorithm still performs well.[7]
4.2
Soft-margin
w
2,
i=1
max (0, 1 yi (w
xi b))
Nonlinear classication
n
2
i=1
w .
6.1 Primal
Kernel machine
Minimizing (2) can be rewritten as a constrained optimization problem with a dierentiable objective function
in the following way.
For each i {1, . . . , n} we introduce the variable i
, and note that i = max (0, 1 yi (w xi + b)) if and
only if i is the smallest nonnegative number satisfying
yi (w xi + b) 1 i .
Thus we can rewrite the optimization problem as follows
i=1 i
+ w2
6.2
Dual
to subject
1
2n all fori.
i=1 ci yi
= 0, and0 ci
The coecients ci can be solved for using quadratic programming, as before. Again, we can nd some index i
such that 0 < ci < (2n)1 , so that (xi ) lies on
the boundary of the margin in the transformed space, and
then solve
[
i=1 ci yi
b=w
(xi ) yi =
i=1 ci
= 0, and0 ci
k=1
[ n
]
ck yk (xk ) (xi ) yi
]
ck yk k(xk , xi ) yi .
k=1
z 7
sgn(w
(z) + b)
n
This is called the dual problem. Since the dual minimizasgn ([ i=1 ci yi k(xi , z)] + b) .
tion problem is a quadratic function of the ci subject to
linear constraints, it is eciently solvable by quadratic
programming algorithms.
6.4 Modern methods
6.3
Sub-gradient descent algorithms for the SVM work directly with the expression
Kernel trick
f (w,
b)
[1
]
n
i=1 max (0, 1 yi (w xi + b))
n
w2 .
=
+
Suppose now that we would like to learn a nonlinear classication rule which corresponds to a linear classication rule for the transformed data points (xi ). More and b . As such,
over, we are given a kernel function k which satises Note that f is a convex function of w
traditional
gradient
descent
(or
SGD)
methods can be
k(xi , xj ) = (xi ) (xj ) .
adapted, where instead of taking a step in the direction of
We know the classication vector w
in the transformed the functions gradient, a step is taken in the direction of
space satises
a vector selected from the functions sub-gradient. This
approach has the advantage that, for certain implementan
w
= i=1 ci yi (xi ),
tions, the number of iterations does not scale with n , the
number of data points.[8]
where the ci are obtained by solving the optimization
problem
6.4.2 Coordinate descent
n
n
n
1
maximize f (c1 . . . cn ) =
ci
yi ciCoordinate
((xi ) (xdescent
j ))yj cjalgorithms for the SVM work from
2 i=1 j=1
i=1
the dual problem
n
n
n
n
1
. . . cn )
=
+
=
ci
yi cimaximize
k(
xin, xj
)yfjnc(c
i=1 ci
j1
1
2
y
c
(x
x
)y
c
,
i
i
i
j
j
j
i=1
i=1 j=1
2
i=1
j=1
7.2
i=1 ci yi
= 0, and0 ci
7.1
Risk minimization
In supervised learning, one is given a set of training examples X1 . . . Xn with labels y1 . . . yn , and wishes to predict yn+1 given Xn+1 . To do so one forms a hypothesis,
f , such that f (Xn+1 ) is a good approximation of yn+1
. A good approximation is usually dened with the help
of a loss function, (y, z) , which characterizes how bad
z is as a prediction of y . We would then like to choose a
hypothesis that minimizes the expected risk:
More generally, R(f ) can be some measure of the complexity of the hypothesis f , so that simpler hypotheses
are preferred.
{
1
ifpx 1/2
f (x) =
1 otherwise
EXTENSIONS
8.2 Issues
Properties
SVC is a similar method that also builds on kernel functions but is appropriate for unsupervised learning and
data-mining. It is considered a fundamental method in
data science.
8.1
Parameter selection
Building binary classiers which distinguish (i) between one of the labels and the rest (one-versus-all)
or (ii) between every pair of classes (one-versusone). Classication of new instances for the oneversus-all case is done by a winner-takes-all strategy, in which the classier with the highest output
function assigns the class (it is important that the
output functions be calibrated to produce comparable scores). For the one-versus-one approach, classication is done by a max-wins voting strategy, in
which every classier assigns the instance to one of
the two classes, then the vote for the assigned class is
increased by one vote, and nally the class with the
most votes determines the instance classication.
7
single optimization problem, rather than decomposing it data close to the model prediction. Another SVM verinto multiple binary classication problems.[17] See also sion known as least squares support vector machine (LSLee, Lin and Wahba.[18][19]
SVM) has been proposed by Suykens and Vandewalle.[22]
Training the original SVR means solving[23]
9.3
yi (w
xi b) 1,
xj b) 1,
yj (w
and
yj {1, 1}.
Transductive support vector machines were introduced by
Vladimir N. Vapnik in 1998.
11 Implementation
14
12
Applications
13
See also
REFERENCES
Polynomial kernel
Predictive analytics
Regularization perspectives on support vector machines
Relevance vector machine, a probabilistic sparse
kernel model identical in functional form to SVM
Sequential minimal optimization
Space mapping
Winnow (algorithm)
14 References
[1] Cortes, C.; Vapnik, V. (1995).
Support-vector
networks.
Machine Learning 20 (3):
273.
doi:10.1007/BF00994018.
[2] Ben-Hur, Asa, Horn, David, Siegelmann, Hava, and Vapnik, Vladimir; Support vector clustering (2001) Journal
of Machine Learning Research, 2: 125137.
[3]
[14] Hsu, Chih-Wei; and Lin, Chih-Jen (2002). A Comparison of Methods for Multiclass Support Vector Machines.
IEEE Transactions on Neural Networks.
[30] R.-E. Fan; K.-W. Chang; C.-J. Hsieh; X.-R. Wang; C.J. Lin (2008). LIBLINEAR: A library for large linear
classication. Journal of Machine Learning Research 9:
18711874.
[31] Zeyuan Allen Zhu; et al. (2009). P-packSVM: Parallel
Primal grAdient desCent Kernel SVM (PDF). ICDM.
[32] Vapnik, V.: Invited Speaker. IPMU Information Processing and Management 2014)
[33] Barghout, Lauren. Spatial-Taxon Information Granules as Used in Iterative Fuzzy-Decision-Making for Image Segmentation. Granular Computing and DecisionMaking. Springer International Publishing, 2015. 285318.
15 Bibliography
Theodoridis, Sergios; and Koutroumbas, Konstantinos; Pattern Recognition, 4th Edition, Academic
Press, 2009, ISBN 978-1-59749-272-0
Cristianini, Nello; and Shawe-Taylor, John; An Introduction to Support Vector Machines and other
kernel-based learning methods, Cambridge University Press, 2000. ISBN 0-521-78019-5 ( SVM Book)
10
Kecman, Vojislav; Learning and Soft Computing
Support Vector Machines, Neural Networks, Fuzzy
Logic Systems, The MIT Press, Cambridge, MA,
2001.
Schlkopf, Bernhard; and Smola, Alexander J.;
Learning with Kernels, MIT Press, Cambridge, MA,
2002. ISBN 0-262-19475-9
Schlkopf, Bernhard; Burges, Christopher J. C.; and
Smola, Alexander J. (editors); Advances in Kernel Methods: Support Vector Learning, MIT Press,
Cambridge, MA, 1999. ISBN 0-262-19416-3.
Shawe-Taylor, John; and Cristianini, Nello; Kernel
Methods for Pattern Analysis, Cambridge University
Press, 2004. ISBN 0-521-81397-2 ( Kernel Methods
Book)
Steinwart, Ingo; and Christmann, Andreas; Support
Vector Machines, Springer-Verlag, New York, 2008.
ISBN 978-0-387-77241-7 ( SVM Book)
Tan, Peter Jing; and Dowe, David L. (2004); MML
Inference of Oblique Decision Trees, Lecture Notes
in Articial Intelligence (LNAI) 3339, SpringerVerlag, pp. 10821088. (This paper uses minimum
message length (MML) and actually incorporates
probabilistic support vector machines in the leaves
of decision trees.)
16
EXTERNAL LINKS
16 External links
www.support-vector.net The key book about the
method, An Introduction to Support Vector Machines with online software
Burges, Christopher J. C.; A Tutorial on Support Vector Machines for Pattern Recognition, Data
Mining and Knowledge Discovery 2:121167, 1998
www.kernel-machines.org (general information and
collection of research papers)
www.support-vector-machines.org (Literature, Review, Software, Links related to Support Vector Machines Academic Site)
Fradkin, Dmitriy; and Muchnik, Ilya; Support Vector Machines for Classication in Abello, J.; and
Carmode, G. (Eds); Discrete Methods in Epidemiology, DIMACS Series in Discrete Mathematics and
Theoretical Computer Science, volume 70, pp. 13
20, 2006. . Succinctly describes theoretical ideas
behind SVM.
Bennett, Kristin P.; and Campbell, Colin; Support
Vector Machines: Hype or Hallelujah?, SIGKDD
Explorations, 2, 2, 2000, 113. . Excellent introduction to SVMs with helpful gures.
Ivanciuc, Ovidiu; Applications of Support Vector
Machines in Chemistry, in Reviews in Computational
Chemistry, Volume 23, 2007, pp. 291400. Reprint
available:
Karatzoglou, Alexandros et al.; Support Vector Machines in R, Journal of Statistical Software April
2006, Volume 15, Issue 9.
liblinear liblinear is a library for large linear classication including some SVMs
Shark Shark is a C++ machine learning library implementing various types of SVMs
dlib dlib is a C++ library for working with kernel
methods and SVMs
SVM light is a collection of software tools for learning and classication using SVM.
SVMJS live demo is a GUI demo for Javascript implementation of SVMs
Gesture Recognition Toolkit contains an easy to use
wrapper for libsvm
11
17
17.1
17.2
Images
File:Kernel_Machine.png Source: https://upload.wikimedia.org/wikipedia/commons/1/1b/Kernel_Machine.png License: CC0 Contributors: Own work Original artist: Alisneaky
File:Portal-puzzle.svg Source: https://upload.wikimedia.org/wikipedia/en/f/fd/Portal-puzzle.svg License: Public domain Contributors: ?
Original artist: ?
File:Svm_max_sep_hyperplane_with_margin.png Source: https://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_
hyperplane_with_margin.png License: Public domain Contributors: Own work Original artist: Cyc
File:Svm_separating_hyperplanes_(SVG).svg Source:
https://upload.wikimedia.org/wikipedia/commons/b/b5/Svm_separating_
hyperplanes_%28SVG%29.svg License: CC BY-SA 3.0 Contributors: This le was derived from: Svm separating hyperplanes.png
Original artist: User:ZackWeinberg, based on PNG version by User:Cyc
17.3
Content license