Support Vector Machine

Support vector machine
Not to be confused with Secure Virtual Machine.
used by SVM schemes are designed to ensure that dot

products may be computed easily in terms of the variables
in the original space, by dening them in terms of a kernel
function k(x, y) selected to suit the problem.[3] The hyperplanes in the higher-dimensional space are dened as
the set of points whose dot product with a vector in that
space is constant. The vectors dening the hyperplanes
can be chosen to be linear combinations with parameters i of images of feature vectors xi that occur in the
data base. With this choice of a hyperplane, the points x
in the feature space that are mapped
into the hyperplane
are dened by the relation:
k(x
i
i , x) = constant.
i
Note that if k(x, y) becomes small as y grows further
away from x , each term in the sum measures the degree
of closeness of the test point x to the corresponding data
base point xi . In this way, the sum of kernels above can
be used to measure the relative nearness of each test point
to the data points originating in one or the other of the sets
to be discriminated. Note the fact that the set of points
x mapped into any hyperplane can be quite convoluted
as a result, allowing much more complex discrimination
between sets which are not convex at all in the original
space.
In machine learning, support vector machines (SVMs,

also support vector networks[1] ) are supervised learning models with associated learning algorithms that analyze data used for classication and regression analysis.
Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm
builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary
linear classier. An SVM model is a representation of
the examples as points in space, mapped so that the examples of the separate categories are divided by a clear
gap that is as wide as possible. New examples are then
mapped into that same space and predicted to belong to
a category based on which side of the gap they fall on.
In addition to performing linear classication, SVMs can
eciently perform a non-linear classication using what
is called the kernel trick, implicitly mapping their inputs
into high-dimensional feature spaces.
When data are not labeled, supervised learning is not possible, and an unsupervised learning approach is required,
which attempts to nd natural clustering of the data to
groups, and then map new data to these formed groups.
The clustering algorithm which provides an improvement 2 History
to the support vector machines is called support vector
clustering[2] and is often used in industrial applications
The original SVM algorithm was invented by Vladimir N.
either when data is not labeled or when only some data is
Vapnik and Alexey Ya. Chervonenkis in 1963. In 1992,
labeled as a preprocessing for a classication pass.
Bernhard E. Boser, Isabelle M. Guyon and Vladimir
N. Vapnik suggested a way to create nonlinear classiers by applying the kernel trick to maximum-margin
hyperplanes.[4] The current standard incarnation (soft
1 Denition
margin) was proposed by Corinna Cortes and Vapnik in
[1]
More formally, a support vector machine constructs a 1993 and published in 1995.
hyperplane or set of hyperplanes in a high- or innitedimensional space, which can be used for classication,
regression, or other tasks. Intuitively, a good separation 3 Motivation
is achieved by the hyperplane that has the largest distance
to the nearest training-data point of any class (so-called Classifying data is a common task in machine learning.
functional margin), since in general the larger the margin Suppose some given data points each belong to one of two
the lower the generalization error of the classier.
classes, and the goal is to decide which class a new data
Whereas the original problem may be stated in a nite di- point will be in. In the case of support vector machines, a
mensional space, it often happens that the sets to discrim- data point is viewed as a p -dimensional vector (a list of p
inate are not linearly separable in that space. For this rea- numbers), and we want to know whether we can separate
son, it was proposed that the original nite-dimensional such points with a (p 1) -dimensional hyperplane. This
space be mapped into a much higher-dimensional space, is called a linear classier. There are many hyperplanes
presumably making the separation easier in that space. that might classify the data. One reasonable choice as the
To keep the computational load reasonable, the mappings best hyperplane is the one that represents the largest sepa1
X2
H1
H2
LINEAR SVM
H3
X1
H1 does not separate the classes. H2 does, but only with a small
margin. H3 separates them with the maximum margin.
ration, or margin, between the two classes. So we choose

the hyperplane so that the distance from it to the nearest
data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classier it denes is known as a maximum margin classier; or equivalently, the perceptron of
optimal stability.
Linear SVM
We are given a training dataset of n points of the form
Maximum-margin hyperplane and margins for an SVM trained

with samples from two classes. Samples on the margin are called
the support vectors.
data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is
called the margin, and the maximum-margin hyperplane is the hyperplane that lies halfway between them.
These hyperplanes can be described by the equations
w
x b = 1
and
(x1 , y1 ), . . . , (xn , yn )
where the yi are either 1 or 1, each indicating the class
to which the point xi belongs. Each xi is a p -dimensional
real vector. We want to nd the maximum-margin hyperplane that divides the group of points xi for which
yi = 1 from the group of points for which yi = 1 ,
which is dened so that the distance between the hyperplane and the nearest point xi from either group is maximized.
w
x b = 1.
Geometrically, the distance between these two hyper2
planes is w
, so to maximize the distance between the
planes we want to minimize w
. As we also have to
prevent data points from falling into the margin, we add
the following constraint: for each i either
Any hyperplane can be written as the set of points x satisfying

w
x b = 0,
w
xi b 1, if yi = 1
or
w
xi b 1, if yi = 1.
where w
is the (not necessarily normalized) normal vector
b
to the hyperplane. The parameter w
determines the
These constraints state that each data point must lie on
oset of the hyperplane from the origin along the normal
the correct side of the margin.
vector w
.
This can be rewritten as:
4.1
Hard-margin
yi (w
xi b) 1, all for 1 i n.
(1)
If the training data are linearly separable, we can select
two parallel hyperplanes that separate the two classes of We can put this together to get the optimization problem:
3
Minimize w
subject to yi (w
xi b)
1, for i = 1, . . . , n "
by a nonlinear kernel function. This allows the algorithm to t the maximum-margin hyperplane in a transformed feature space. The transformation may be nonlinThe w
and b that solve this problem determine our clas- ear and the transformed space high dimensional; although
the classier is a hyperplane in the transformed feature
sier, x 7 sgn(w
x b) .
space, it may be nonlinear in the original input space.
An easy-to-see but important consequence of this geometric description is that max-margin hyperplane is com- It is noteworthy that working in a higher-dimensional feapletely determined by those xi which lie nearest to it. ture space increases the generalization error of support
vector machines, although given enough samples the alThese xi are called support vectors.
gorithm still performs well.[7]
4.2
Some common kernels include:
Soft-margin
To extend SVM to cases in which the data are not linearly

separable, we introduce the hinge loss function,
max (0, 1 yi (w
xi b)) .
This function is zero if the constraint in (1) is satised, in
other words, if xi lies on the correct side of the margin.
For data on the wrong side of the margin, the functions
value is proportional to the distance from the margin.
We then wish to minimize
[ 1 n
n
w
2,
i=1
max (0, 1 yi (w
xi b))
Polynomial (homogeneous): k(xi , xj ) = (xi xj )d

Polynomial (inhomogeneous): k(xi , xj ) = (xi
xj + 1)d
Gaussian radial basis function: k(xi , xj ) =
exp(xi xj 2 ) , for > 0 . Sometimes
parametrized using = 1/2 2
Hyperbolic tangent: k(xi , xj ) = tanh(xi xj + c)
, for some (not every) > 0 and c < 0
The kernel is related to the transform (xi ) by the equation k(xi , xj ) = (xi ) (xj ) . The
value w is also in
xi ) . Dot
the transformed space, with w
=
i i yi (
products with w for classication canagain be computed
by the kernel trick, i.e. w
(x) = i i yi k(xi , x) .
where the parameter determines the tradeo between

increasing the margin-size and ensuring that the xi lie on
the correct side of the margin. Thus, for suciently small
values of , the soft-margin SVM will behave identically 6 Computing the SVM classier
to the hard-margin SVM if the input data are linearly classiable, but will still learn a viable classication rule if Computing the (soft-margin) SVM classier amounts to
not.
minimizing an expression of the form
[ 1 n
Nonlinear classication
n
2
max (0, 1 yi (w xi + b))

(2)
i=1
w .
We focus on the soft-margin classier since, as noted

above, choosing a suciently small value for yields
the hard-margin classier for linearly classiable input
data. The classical approach, which involves reducing (2)
to a quadratic programming problem, is detailed below.
Then, more recent approaches such as sub-gradient descent and coordinate descent will be discussed.
6.1 Primal
Kernel machine
The original maximum-margin hyperplane algorithm

proposed by Vapnik in 1963 constructed a linear classier. However, in 1992, Bernhard E. Boser, Isabelle
M. Guyon and Vladimir N. Vapnik suggested a way to
create nonlinear classiers by applying the kernel trick
(originally proposed by Aizerman et al.[5] ) to maximummargin hyperplanes.[6] The resulting algorithm is formally similar, except that every dot product is replaced
Minimizing (2) can be rewritten as a constrained optimization problem with a dierentiable objective function
in the following way.
For each i {1, . . . , n} we introduce the variable i
, and note that i = max (0, 1 yi (w xi + b)) if and
only if i is the smallest nonnegative number satisfying
yi (w xi + b) 1 i .
Thus we can rewrite the optimization problem as follows
6 COMPUTING THE SVM CLASSIFIER

minimize n1
i=1 i
+ w2
to subjectyi (xi w +b) 1i and i

0, all fori.
This is called the primal problem.
6.2
Dual
to subject
1
2n all fori.
i=1 ci yi
= 0, and0 ci
The coecients ci can be solved for using quadratic programming, as before. Again, we can nd some index i
such that 0 < ci < (2n)1 , so that (xi ) lies on
the boundary of the margin in the transformed space, and
then solve
[
By solving for the Lagrangian dual of the above problem,

one obtains the simplied problem
f (c1 . . . cn ) =
maximize
n n
1
i=1
j=1 yi ci (xi xj )yj cj ,
2
to subject
1
2n all fori.
i=1 ci yi
b=w
(xi ) yi =
i=1 ci
= 0, and0 ci
k=1
[ n
]
ck yk (xk ) (xi ) yi
]
ck yk k(xk , xi ) yi .
k=1
Finally, new points can be classied by computing
z 7
sgn(w
(z) + b)
n
This is called the dual problem. Since the dual minimizasgn ([ i=1 ci yi k(xi , z)] + b) .
tion problem is a quadratic function of the ci subject to
linear constraints, it is eciently solvable by quadratic
programming algorithms.
6.4 Modern methods
Here, the variables ci are dened such that
Recent algorithms for nding the SVM classier include

sub-gradient descent and coordinate descent. Both techw
= i=1 ci yi xi .
niques have proven to oer signicant advantages over
Moreover, ci = 0 exactly when xi lies on the correct side the traditional approach when dealing with large, sparse
of the margin, and 0 < ci < (2n)1 when xi lies on datasetssub-gradient methods are especially ecient
the margins boundary. It follows that w
can be written when there are many training examples, and coordinate
descent when the dimension of the feature space is high.
as a linear combination of the support vectors.
n
The oset, b , can be recovered by nding an xi on the

6.4.1 Sub-gradient descent
margins boundary and solving
yi (w
xi + b) = 1 b = yi w
xi .
6.3
Sub-gradient descent algorithms for the SVM work directly with the expression
Kernel trick
f (w,
b)
[1
]
n
i=1 max (0, 1 yi (w xi + b))
n
w2 .
=
+
Suppose now that we would like to learn a nonlinear classication rule which corresponds to a linear classication rule for the transformed data points (xi ). More and b . As such,
over, we are given a kernel function k which satises Note that f is a convex function of w
traditional
gradient
descent
(or
SGD)
methods can be
k(xi , xj ) = (xi ) (xj ) .
adapted, where instead of taking a step in the direction of
We know the classication vector w
in the transformed the functions gradient, a step is taken in the direction of
space satises
a vector selected from the functions sub-gradient. This
approach has the advantage that, for certain implementan
w
= i=1 ci yi (xi ),
tions, the number of iterations does not scale with n , the
number of data points.[8]
where the ci are obtained by solving the optimization
problem
6.4.2 Coordinate descent
n
n
n
1
maximize f (c1 . . . cn ) =
ci
yi ciCoordinate
((xi ) (xdescent
j ))yj cjalgorithms for the SVM work from
2 i=1 j=1
i=1
the dual problem
n
n
n
n
1
. . . cn )
=
+
=
ci
yi cimaximize
k(
xin, xj
)yfjnc(c
i=1 ci
j1
1
2
y
c
(x
x
)y
c
,
i
i
i
j
j
j
i=1
i=1 j=1
2
i=1
j=1
7.2
Regularization and stability

to subject
1
2n all fori.
i=1 ci yi
= 0, and0 ci
7.2 Regularization and stability

In order for the minimization problem to have a welldened solution, we have to place constraints on the set H
of hypotheses being considered. If H is a normed space
(as is the case for SVM), a particularly eective technique is to consider only those hypotheses f for which
f H < k . This is equivalent to imposing a regularization penalty R(f ) = k f H , and solving the new
optimization problem
f = arg minf H (f ) + R(f ) .
For each i {1, . . . , n} , iteratively, the coecient ci is

adjusted in the direction of f /ci . Then, the resulting
vector of coecients (c1 , . . . , cn ) is projected onto the
nearest vector of coecients that satises the given constraints. (Typically Euclidean distances are used.) The
process is then repeated until a near-optimal vector of
coecients is obtained. The resulting algorithm is extremely fast in practice, although few performance guarantees have been proven.[9]
This approach is called Tikhonov regularization.
Empirical risk minimization
The soft-margin support vector machine described above

is an example of an empirical risk minimization (ERM)
algorithm for the hinge loss. Seen this way, support vector
machines belong to a natural class of algorithms for statistical inference, and many of its unique features are due to
the behavior of the hinge loss. This perspective can provide further insight into how and why SVMs work, and
allow us to better analyze their statistical properties.
7.1
Risk minimization
In supervised learning, one is given a set of training examples X1 . . . Xn with labels y1 . . . yn , and wishes to predict yn+1 given Xn+1 . To do so one forms a hypothesis,
f , such that f (Xn+1 ) is a good approximation of yn+1
. A good approximation is usually dened with the help
of a loss function, (y, z) , which characterizes how bad
z is as a prediction of y . We would then like to choose a
hypothesis that minimizes the expected risk:
(f ) = E [(yn+1 , f (Xn+1 ))]
More generally, R(f ) can be some measure of the complexity of the hypothesis f , so that simpler hypotheses
are preferred.
7.3 SVM and the hinge loss

Recall that the (soft-margin) SVM classier w,
b : x 7
sgn(w
x + b) is chosen to minimize the following expression.
[ 1 n
]
2
i=1 max (0, 1 yi (w xi + b)) + w
n
In light of the above discussion, we see that the SVM
technique is equivalent to empirical risk minimization
with Tikhonov regularization, where in this case the loss
function is the hinge loss
(y, z) = max (0, 1 yz) .

From this perspective, SVM is closely related to other
fundamental classication algorithms such as regularized
least-squares and logistic regression. The dierence between the three lies in the choice of loss function: regularized least-squares amounts to empirical risk minimization
with the square-loss, sq (y, z) = (y z)2 ; logistic regression employs the log-loss, log (y, z) = ln(1 + eyz )
.
In most cases, we don't know the joint distribution of

Xn+1 , yn+1 outright. In these cases, a common strategy
7.3.1 Target functions
is to choose the hypothesis that minimizes the empirical
risk:
The dierence between the hinge loss and these other loss
functions is best stated in terms of target functions - the
function that minimizes expected risk for a given pair of
n
1
random variables X, y .
(yk , f (Xk ))
(f ) =
n
k=1
In particular, let yx denote y conditional on the event that
X = x . In the classication setting, we have:
Under certain assumptions about the sequence of random
variables Xk , yk (for example, that they are generated by
{
a nite Markov process), if the set of hypotheses being
probability withpx
considered is small enough, the minimizer of the empir- y = 1
x
1 probability with1 px
ical risk will closely approximate the minimizer of the
expected risk as n grows large. This approach is called
empirical risk minimization, or ERM.
The optimal classier is therefore:
{
1
ifpx 1/2
f (x) =
1 otherwise
EXTENSIONS
8.2 Issues
Potential drawbacks of the SVM are the following three

For the square-loss, the target function is the conditional aspects:
expectation function, fsq (x) = E [yx ] ; For the logistic
loss, its the logit function, flog( x) = ln (px /(1 px )) .
Requires full labeling of input data
While both of these target functions yield the correct classier, as sgn(fsq ) = sgn(flog ) = f , they give us more
Uncalibrated class membership probabilities
information than we need. In fact, they give us enough
The SVM is only directly applicable for two-class
information to completely describe the distribution of yx
tasks. Therefore, algorithms that reduce the multi.
class task to several binary problems have to be apOn the other hand, one can check that the target function
plied; see the multi-class SVM section.
for the hinge loss is exactly f . Thus, in a suciently
Parameters of a solved model are dicult to interrich hypothesis spaceor equivalently, for an appropripret.
ately chosen kernelthe SVM classier will converge to
the simplest function (in terms of R ) that correctly classies the data. This extends the geometric interpretation
of SVMfor linear classication, the empirical risk is 9 Extensions
minimized by any function whose margins lie between
the support vectors, and the simplest of these is the max9.1 Support vector clustering (SVC)
margin classier.[10]
Properties
SVMs belong to a family of generalized linear classiers

and can be interpreted as an extension of the perceptron.
They can also be considered a special case of Tikhonov
regularization. A special property is that they simultaneously minimize the empirical classication error and
maximize the geometric margin; hence they are also
known as maximum margin classiers.
SVC is a similar method that also builds on kernel functions but is appropriate for unsupervised learning and
data-mining. It is considered a fundamental method in
data science.
9.2 Multiclass SVM

Multiclass SVM aims to assign labels to instances by using
support vector machines, where the labels are drawn from
a nite set of several elements.
The dominant approach for doing so is to reduce the

single multiclass problem into multiple binary classicaA comparison of the SVM to other classiers has been tion problems.[13] Common methods for such reduction
include:[13][14]
made by Meyer, Leisch and Hornik.[11]
8.1
Parameter selection
Building binary classiers which distinguish (i) between one of the labels and the rest (one-versus-all)
or (ii) between every pair of classes (one-versusone). Classication of new instances for the oneversus-all case is done by a winner-takes-all strategy, in which the classier with the highest output
function assigns the class (it is important that the
output functions be calibrated to produce comparable scores). For the one-versus-one approach, classication is done by a max-wins voting strategy, in
which every classier assigns the instance to one of
the two classes, then the vote for the assigned class is
increased by one vote, and nally the class with the
most votes determines the instance classication.
The eectiveness of SVM depends on the selection of

kernel, the kernels parameters, and soft margin parameter C. A common choice is a Gaussian kernel,
which has a single parameter . The best combination of C and is often selected by a grid search
with exponentially growing sequences of C and ,
for example, C {25 , 23 , . . . , 213 , 215 } ;
{215 , 213 , . . . , 21 , 23 } . Typically, each combination
of parameter choices is checked using cross validation,
and the parameters with best cross-validation accuracy
are picked. Alternatively, recent work in Bayesian opti Directed acyclic graph SVM (DAGSVM)[15]
mization can be used to select C and , often requiring
Error-correcting output codes[16]
the evaluation of far fewer parameter combinations than
grid search. The nal model, which is used for testing
and for classifying new data, is then trained on the whole Crammer and Singer proposed a multiclass SVM method
which casts the multiclass classication problem into a
training set using the selected parameters.[12]
7
single optimization problem, rather than decomposing it data close to the model prediction. Another SVM verinto multiple binary classication problems.[17] See also sion known as least squares support vector machine (LSLee, Lin and Wahba.[18][19]
SVM) has been proposed by Suykens and Vandewalle.[22]
Training the original SVR means solving[23]
9.3
Transductive support vector machines
Transductive support vector machines extend SVMs 1 w2

in that they could also treat partially labeled data in 2
semi-supervised learning by following the principles of {
yi w, xi b
transduction. Here, in addition to the training set D , the
w, xi + b yi
learner is also given a set
where xi is a training sample with target value yi . The
inner product plus intercept w, xi + b is the prediction
D = {xi | xi Rp }ki=1
for that sample, and is a free parameter that serves as
of test examples to be classied. Formally, a transductive a threshold: all predictions have to be within an range
support vector machine is dened by the following primal of the true predictions. Slack variables are usually added
into the above to allow for errors and to allow approximaoptimization problem:[20]
tion in the case the above problem is infeasible.
Minimize (in w,
b, y )
1
w
2
2
subject to (for any i = 1, . . . , n and any j = 1, . . . , k )
yi (w
xi b) 1,
xj b) 1,
yj (w
and
yj {1, 1}.
Transductive support vector machines were introduced by
Vladimir N. Vapnik in 1998.
10 Interpreting SVM models

The SVM algorithm has been widely applied in the biological and other sciences. Permutation tests based on
SVM weights have been suggested as a mechanism for
interpretation of SVM models.[24][25] Support vector machine weights have also been used to interpret SVM models in the past.[26] Posthoc interpretation of support vector
machine models in order to identify features used by the
model to make predictions is a relatively new area of research with special signicance in the biological sciences.
11 Implementation
The parameters of the maximum-margin hyperplane are

derived by solving the optimization. There exist several
specialized algorithms for quickly solving the QP prob9.4 Structured SVM
lem that arises from SVMs, mostly relying on heurisSVMs have been generalized to structured SVMs, where tics for breaking the problem down into smaller, morethe label space is structured and of possibly innite size. manageable chunks.
Another approach is to use an interior point method
that uses Newton-like iterations to nd a solution of the
9.5 Regression
KarushKuhnTucker conditions of the primal and dual
[27]
Instead of solving a sequence of broken
problems.
A version of SVM for regression was proposed in 1996
down
problems,
this approach directly solves the probby Vladimir N. Vapnik, Harris Drucker, Christopher J.
lem
altogether.
To
avoid solving a linear system involving
[21]
C. Burges, Linda Kaufman and Alexander J. Smola.
the
large
kernel
matrix,
a low rank approximation to the
This method is called support vector regression (SVR).
matrix
is
often
used
in
the
kernel trick.
The model produced by support vector classication (as
described above) depends only on a subset of the train- Another common method is Platts sequential miniing data, because the cost function for building the model mal optimization (SMO) algorithm, which breaks the
does not care about training points that lie beyond the problem down into 2-dimensional sub-problems that are
margin. Analogously, the model produced by SVR de- solved analytically, eliminating the need for a numerical
pends only on a subset of the training data, because the optimization algorithm and matrix storage. This algocost function for building the model ignores any training rithm is conceptually simple, easy to implement, gener-
14
ally faster, and has better scaling properties for dicult

SVM problems.[28]
The special case of linear support vector machines can
be solved more eciently by the same kind of algorithms used to optimize its close cousin, logistic regression; this class of algorithms includes sub-gradient
descent (e.g., PEGASOS[29] ) and coordinate descent
(e.g., LIBLINEAR[30] ). LIBLINEAR has some attractive training time properties. Each convergence iteration takes time linear in the time taken to read the train
data and the iterations also have a Q-Linear Convergence
property, making the algorithm extremely fast.
The general kernel SVMs can also be solved more eciently using sub-gradient descent (e.g. P-packSVM[31] ),
especially when parallelization is allowed.
Kernel SVMs are available in many machine learning
toolkits, including LIBSVM, MATLAB, SAS, SVMlight, kernlab, scikit-learn, Shogun, Weka, Shark,
JKernelMachines, OpenCV and others.
12
Applications
SVMs can be used to solve various real world problems:

SVMs are helpful in text and hypertext categorization as their application can signicantly reduce the
need for labeled training instances in both the standard inductive and transductive settings.
Classication of images can also be performed using SVMs. Experimental results show that SVMs
achieve signicantly higher search accuracy than
traditional query renement schemes after just three
to four rounds of relevance feedback. This is also
true of image segmentation systems, including those
using a modied version SVM that uses the privileged approach as suggested Vapnik [32] [33]
of Uncertainty in Knowledge-Based Systems (2014)
SVMs are also useful in medical science to classify
proteins with up to 90% of the compounds classied
correctly.
Hand-written characters can be recognized using
SVM.
13
See also
In situ adaptive tabulation

Kernel machines
Fisher kernel
Platt scaling
REFERENCES
Polynomial kernel
Predictive analytics
Regularization perspectives on support vector machines
Relevance vector machine, a probabilistic sparse
kernel model identical in functional form to SVM
Sequential minimal optimization
Space mapping
Winnow (algorithm)
14 References
[1] Cortes, C.; Vapnik, V. (1995).
Support-vector
networks.
Machine Learning 20 (3):
273.
doi:10.1007/BF00994018.
[2] Ben-Hur, Asa, Horn, David, Siegelmann, Hava, and Vapnik, Vladimir; Support vector clustering (2001) Journal
of Machine Learning Research, 2: 125137.
[3]
Press, William H.; Teukolsky, Saul A.; Vetterling,

William T.; Flannery, B. P. (2007). Section 16.5.
Support Vector Machines. Numerical Recipes:
The Art of Scientic Computing (3rd ed.). New
York: Cambridge University Press. ISBN 978-0521-88068-8.
[4] Boser, B. E.; Guyon, I. M.; Vapnik, V. N. (1992).

A training algorithm for optimal margin classiers.
Proceedings of the fth annual workshop on Computational learning theory COLT '92. p. 144.
doi:10.1145/130385.130401. ISBN 089791497X.
[5] Aizerman, Mark A.; Braverman, Emmanuel M.; and Rozonoer, Lev I. (1964). Theoretical foundations of the potential function method in pattern recognition learning.
Automation and Remote Control 25: 821837.
[6] Boser, B. E.; Guyon, I. M.; Vapnik, V. N. (1992).
A training algorithm for optimal margin classiers.
Proceedings of the fth annual workshop on Computational learning theory COLT '92. p. 144.
doi:10.1145/130385.130401. ISBN 089791497X.
[7] Jin, Chi; Wang, Liwei (2012). Dimensionality dependent
PAC-Bayes margin bound. Advances in Neural Information Processing Systems.
[8] Shalev-Shwartz, Shai; Singer, Yoram; Srebro, Nathan;
Cotter, Andrew (2010-10-16). Pegasos: primal estimated sub-gradient solver for SVM. Mathematical
Programming 127 (1): 330. doi:10.1007/s10107-0100420-4. ISSN 0025-5610.
[9] Hsieh, Cho-Jui; Chang, Kai-Wei; Lin, Chih-Jen; Keerthi,
S. Sathiya; Sundararajan, S. (2008-01-01). A Dual Coordinate Descent Method for Large-scale Linear SVM.
Proceedings of the 25th International Conference on Machine Learning. ICML '08 (New York, NY, USA: ACM):
408415. doi:10.1145/1390156.1390208. ISBN 978-160558-205-4.
[10] Rosasco, L; Vito, E; Caponnetto, A; Piana, M;

Verri, A (2004-05-01). Are Loss Functions All the
Same?". Neural Computation 16 (5): 10631076.
doi:10.1162/089976604773135104. ISSN 0899-7667.
[11] Meyer, D.; Leisch, F.; Hornik, K. (2003). The support
vector machine under test. Neurocomputing 55: 169.
doi:10.1016/S0925-2312(03)00431-4.
[12] Hsu, Chih-Wei; Chang, Chih-Chung; and Lin, Chih-Jen
(2003). A Practical Guide to Support Vector Classication
(PDF) (Technical report). Department of Computer Science and Information Engineering, National Taiwan University.
[24] Bilwaj Gaonkar, Christos Davatzikos Analytic estimation

of statistical signicance maps for support vector machine
based multi-variate image analysis and classication
[25] R. Cuingnet, C. Rosso, M. Chupin, S. Lehricy, D. Dormont, H. Benali, Y. Samson and O. Colliot, Spatial regularization of SVM for the detection of diusion alterations
associated with stroke outcome, Medical Image Analysis,
2011, 15 (5): 729737
[26] Statnikov, A., Hardin, D., & Aliferis, C. (2006). Using
SVM weight-based methods to identify causally relevant
and non-causally relevant variables. sign, 1, 4.
[13] Duan, K. B.; Keerthi, S. S. (2005). Which Is the Best

Multiclass SVM Method? An Empirical Study. Multiple Classier Systems. LNCS 3541. pp. 278285.
doi:10.1007/11494683_28. ISBN 978-3-540-26306-7.
[27] Ferris, M. C.; Munson, T. S. (2002).

InteriorPoint Methods for Massive Support Vector Machines. SIAM Journal on Optimization 13 (3): 783.
doi:10.1137/S1052623400374379.
[14] Hsu, Chih-Wei; and Lin, Chih-Jen (2002). A Comparison of Methods for Multiclass Support Vector Machines.
IEEE Transactions on Neural Networks.
[28] John C. Platt (1998). Sequential Minimal Optimization:

A Fast Algorithm for Training Support Vector Machines
(PDF). NIPS.
[15] Platt, John; Cristianini, N.; and Shawe-Taylor, J. (2000).

Large margin DAGs for multiclass classication. In
Solla, Sara A.; Leen, Todd K.; and Mller, Klaus-Robert;
eds. Advances in Neural Information Processing Systems
(PDF). MIT Press. pp. 547553.
[29] Shai Shalev-Shwartz; Yoram Singer; Nathan Srebro

(2007). Pegasos: Primal Estimated sub-GrAdient SOlver
for SVM (PDF). ICML.
[16] Dietterich, Thomas G.; and Bakiri, Ghulum; Bakiri

(1995). Solving Multiclass Learning Problems via
Error-Correcting Output Codes (PDF). Journal of Articial Intelligence Research, Vol.
2 2: 263286.
arXiv:cs/9501101. Bibcode:1995cs........1101D.
[17] Crammer, Koby; and Singer, Yoram (2001). On the
Algorithmic Implementation of Multiclass Kernel-based
Vector Machines (PDF). J. of Machine Learning Research 2: 265292.
[18] Lee, Y.; Lin, Y.; and Wahba, G. (2001). Multicategory
Support Vector Machines (PDF). Computing Science and
Statistics 33.
[19] Lee, Y.; Lin, Y.; Wahba, G. (2004).
Multicategory Support Vector Machines.
Journal of
the American Statistical Association 99 (465): 67.
doi:10.1198/016214504000000098.
[30] R.-E. Fan; K.-W. Chang; C.-J. Hsieh; X.-R. Wang; C.J. Lin (2008). LIBLINEAR: A library for large linear
classication. Journal of Machine Learning Research 9:
18711874.
[31] Zeyuan Allen Zhu; et al. (2009). P-packSVM: Parallel
Primal grAdient desCent Kernel SVM (PDF). ICDM.
[32] Vapnik, V.: Invited Speaker. IPMU Information Processing and Management 2014)
[33] Barghout, Lauren. Spatial-Taxon Information Granules as Used in Iterative Fuzzy-Decision-Making for Image Segmentation. Granular Computing and DecisionMaking. Springer International Publishing, 2015. 285318.
15 Bibliography
[20] Joachims, Thorsten; Transductive Inference for Text

Classication using Support Vector Machines, Proceedings of the 1999 International Conference on Machine
Learning (ICML 1999), pp. 200209.
Theodoridis, Sergios; and Koutroumbas, Konstantinos; Pattern Recognition, 4th Edition, Academic
Press, 2009, ISBN 978-1-59749-272-0
[21] Drucker, Harris; Burges, Christopher J. C.; Kaufman,

Linda; Smola, Alexander J.; and Vapnik, Vladimir N.
(1997); Support Vector Regression Machines, in Advances in Neural Information Processing Systems 9, NIPS
1996, 155161, MIT Press.
Cristianini, Nello; and Shawe-Taylor, John; An Introduction to Support Vector Machines and other
kernel-based learning methods, Cambridge University Press, 2000. ISBN 0-521-78019-5 ( SVM Book)
[22] Suykens, Johan A. K.; Vandewalle, Joos P. L.; Least

squares support vector machine classiers, Neural Processing Letters, vol. 9, no. 3, Jun. 1999, pp. 293300.
Huang, Te-Ming; Kecman, Vojislav; and Kopriva,

Ivica (2006); Kernel Based Algorithms for Mining
Huge Data Sets, in Supervised, Semi-supervised, and
Unsupervised Learning, Springer-Verlag, Berlin,
Heidelberg, 260 pp. 96 illus., Hardcover, ISBN 3540-31681-7
[23] Smola, Alex J.; Schlkopf, Bernhard (2004). A tutorial

on support vector regression (PDF). Statistics and Computing 14 (3): 199222.
10
Kecman, Vojislav; Learning and Soft Computing
Support Vector Machines, Neural Networks, Fuzzy
Logic Systems, The MIT Press, Cambridge, MA,
2001.
Schlkopf, Bernhard; and Smola, Alexander J.;
Learning with Kernels, MIT Press, Cambridge, MA,
2002. ISBN 0-262-19475-9
Schlkopf, Bernhard; Burges, Christopher J. C.; and
Smola, Alexander J. (editors); Advances in Kernel Methods: Support Vector Learning, MIT Press,
Cambridge, MA, 1999. ISBN 0-262-19416-3.
Shawe-Taylor, John; and Cristianini, Nello; Kernel
Methods for Pattern Analysis, Cambridge University
Press, 2004. ISBN 0-521-81397-2 ( Kernel Methods
Book)
Steinwart, Ingo; and Christmann, Andreas; Support
Vector Machines, Springer-Verlag, New York, 2008.
ISBN 978-0-387-77241-7 ( SVM Book)
Tan, Peter Jing; and Dowe, David L. (2004); MML
Inference of Oblique Decision Trees, Lecture Notes
in Articial Intelligence (LNAI) 3339, SpringerVerlag, pp. 10821088. (This paper uses minimum
message length (MML) and actually incorporates
probabilistic support vector machines in the leaves
of decision trees.)
16
EXTERNAL LINKS
Catanzaro, Bryan; Sundaram, Narayanan; and

Keutzer, Kurt; Fast Support Vector Machine Training and Classication on Graphics Processors, in International Conference on Machine Learning, 2008
Campbell, Colin; and Ying, Yiming; Learning with
Support Vector Machines, 2011, Morgan and Claypool. ISBN 978-1-60845-616-1.
Ben-Hur, Asa, Horn, David, Siegelmann, Hava,
and Vapnik, Vladimir; Support vector clustering
(2001) Journal of Machine Learning Research, 2:
125137.
16 External links
www.support-vector.net The key book about the
method, An Introduction to Support Vector Machines with online software
Burges, Christopher J. C.; A Tutorial on Support Vector Machines for Pattern Recognition, Data
Mining and Knowledge Discovery 2:121167, 1998
www.kernel-machines.org (general information and
collection of research papers)
www.support-vector-machines.org (Literature, Review, Software, Links related to Support Vector Machines Academic Site)
Vapnik, Vladimir N.; The Nature of Statistical

Learning Theory, Springer-Verlag, 1995. ISBN 0387-98780-0
videolectures.net (SVM-related video lectures)
Vapnik, Vladimir N.; and Kotz, Samuel; Estimation

of Dependences Based on Empirical Data, Springer,
2006. ISBN 0-387-30865-2, 510 pages [this is a
reprint of Vapniks early book describing philosophy behind SVM approach. The 2006 Appendix
describes recent development].
libsvm LIBSVM is a popular library of SVM learners
Fradkin, Dmitriy; and Muchnik, Ilya; Support Vector Machines for Classication in Abello, J.; and
Carmode, G. (Eds); Discrete Methods in Epidemiology, DIMACS Series in Discrete Mathematics and
Theoretical Computer Science, volume 70, pp. 13
20, 2006. . Succinctly describes theoretical ideas
behind SVM.
Bennett, Kristin P.; and Campbell, Colin; Support
Vector Machines: Hype or Hallelujah?, SIGKDD
Explorations, 2, 2, 2000, 113. . Excellent introduction to SVMs with helpful gures.
Ivanciuc, Ovidiu; Applications of Support Vector
Machines in Chemistry, in Reviews in Computational
Chemistry, Volume 23, 2007, pp. 291400. Reprint
available:
Karatzoglou, Alexandros et al.; Support Vector Machines in R, Journal of Statistical Software April
2006, Volume 15, Issue 9.
liblinear liblinear is a library for large linear classication including some SVMs
Shark Shark is a C++ machine learning library implementing various types of SVMs
dlib dlib is a C++ library for working with kernel
methods and SVMs
SVM light is a collection of software tools for learning and classication using SVM.
SVMJS live demo is a GUI demo for Javascript implementation of SVMs
Gesture Recognition Toolkit contains an easy to use
wrapper for libsvm
11
17
17.1
Text and image sources, contributors, and licenses

Text
Support vector machine Source: https://en.wikipedia.org/wiki/Support_vector_machine?oldid=723567480 Contributors: The Anome,

Gareth Owen, Enchanter, Ryguasu, Edward, Michael Hardy, Dmd3e, Oliver Pereira, Kku, Zeno Gantner, Mark Foskey, Jll, Jimregan,
Hike395, Barak~enwiki, Dysprosia, Pheon, Kgajos, Mpost89, Benwing, Altenmann, Wile E. Heresiarch, Tea2min, Giftlite, Sepreece, BenFrantzDale, Fropu, Dj Vu, Dfrankow, Neilc, Pgan002, Gene s, Urhixidur, Rich Farmbrough, Mathiasl26, Nowozin, Ylai, Bender235,
Andrejj, MisterSheik, Lycurgus, Alex Kosoruko, Aaronbrick, Cyc~enwiki, Rajah, Terrycojones, Diego Moya, Pzoot, Nvrmnd, Gortu,
Boyd Steere, Alai, Stuartyeates, The Belgain, Ralf Mikut, Ruud Koot, Waldir, Joerg Kurt Wegner, Qwertyus, Michal.burda, Brighterorange,
Mathbot, Vsion, Gurch, Sderose, Diza, Tedder, Chobot, Bgwhite, Adoniscik, YurikBot, Ste1n, CambridgeBayWeather, Rsrikanth05, Pseudomonas, Seb35, Korny O'Near, Gareth Jones, Karipuf, Retardo, Bota47, SamuelRiv, Tribaal, Palmin, Thnidu, Digfarenough, CWenger,
Sharat sc, Mebden, Amit man, Lordspaz, Otheus, SmackBot, RDBury, Jwmillerusa, Moeron, InverseHypercube, Vardhanw, Ealdent,
Golwengaud, CommodiCast, Mcdu, Eskimbot, Vermorel, Mcld, Gilliam, AndrewKay, Kazkaskazkasako, Riedl, Srchvrs, Avb, MattOates, Memming, Mitar, Plison, Sadeq, Jonas August, Rijkbenik, Jrouquie, Adilraja, Dicklyon, Mattsachs, Hu12, Ojan, JoeBot, Atreys,
Tawkerbot2, Lavaka, Harold f, Owen214, CmdrObot, FunPika, Ezrakilty, Innohead, Bumbulski, Anthonyhcole, Farzaneh, Carstensen,
Gnfnrf, Thijs!bot, Drowne, Trevyn, Dkemper, Prolog, AnAj, Dougher, Wootery, Sanchom, Shashank4, BrotherE, Coee2theorems, Magioladitis, Pzone, Tremilux, Qjqash3, Americanhero, A3nm, Parunach, Jacktance~enwiki, Ledona delano, Andreas Mueller, Mschel,
FelipeVargasRigo, David Callan, Freeboson, Nickvence, Polusfx, Singularitarian, Senu, Salih, JamesMayeld, Pradeepgk, Supportvector, STBotD, RJASE1, Satyr9, Svm slave, TXiKiBoT, Agricola44, Sanatio, Carcinus, Dirkbb, Majeru, Seminalist, Canavalia, Simonstr, Domination989, Kerveros 99, Quentin Pradet, AaronArvey, Caltas, Behind The Wall Of Sleep, Udirock, Savedthat, JeanSenellart,
Melcombe, Tiny plastic Grey Knight, Wawe1811, Martarius, Sfan00 IMG, Grantbow, Peter.kese, Alexbot, Pot, DeltaQuad, MagnusPI,
Chaosdruid, Djgulp3~enwiki, Qwfp, Asymptosis, Sunsetsky, Addbot, DOI bot, AndrewHZ, MrOllie, Lightbot, Peni, Legobot, Yobot,
Legobot II, AnomieBOT, Erel Segal, Jim1138, Materialscientist, Citation bot, Salbang, Tripshall, Twri, LilHelpa, Megatang, Chuanren,
RibotBOT, Hamidhaji, WaysToEscape, Alisaleh88, HB28205, Romainbrasselet, X7q, Elvenhack, Wikisteve316, Citation bot 1, RedBot, SpaceFlight89, Classier1234, Zadroznyelkan, ACKiwi, Onel5969, RjwilmsiBot, Zhejiangustc, J36miles, EmausBot, Alfaisanomega,
Nacopt, NikolasE~enwiki, Datakid1, Msayag, Stheodor, Colmenares jb, Manyu aditya, ZroBot, Khaled.Boukharouba, Chire, Amit159,
Tolly4bolly, Pintaio, Tahir512, ChengHsuanLi, Ecc81, DaChazTech, Randallbritten, Aaaaa bbbbb1355, Liuyipei, Arafat.sultan, ClueBot NG, Kkddkkdd, Matthiaspaul, MchLrn, Helpful Pixie Bot, Alisneaky, Elferdo, BG19bot, SciCompTeacher, Xlicolts613, Boyander,
Bodormenta, Illia Connell, Dexbot, Elmackev, Me, Myself, and I are Here, Dschslava, Mohraz2, Dloranc, Deepakiitmandi, Adrienbrunetwiki, Bashfuloctopus, Velvel2, CurtisZeng, Cjqed, Malcops, Themidget17, ZhangJiaqiPKU, KasparBot, Tutuwang, Das O2, Vijay.singularity.krish, WangZihao, Brtvandenbroeck, Mpkkumar, Suhas Vijaykumar, ArguMentor, Nazia.alam, Bugli Lorenzo, Bofeng07,
Tommylofstedt and Anonymous: 370
17.2
Images
File:Kernel_Machine.png Source: https://upload.wikimedia.org/wikipedia/commons/1/1b/Kernel_Machine.png License: CC0 Contributors: Own work Original artist: Alisneaky
File:Portal-puzzle.svg Source: https://upload.wikimedia.org/wikipedia/en/f/fd/Portal-puzzle.svg License: Public domain Contributors: ?
Original artist: ?
File:Svm_max_sep_hyperplane_with_margin.png Source: https://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_
hyperplane_with_margin.png License: Public domain Contributors: Own work Original artist: Cyc
File:Svm_separating_hyperplanes_(SVG).svg Source:
https://upload.wikimedia.org/wikipedia/commons/b/b5/Svm_separating_
hyperplanes_%28SVG%29.svg License: CC BY-SA 3.0 Contributors: This le was derived from: Svm separating hyperplanes.png
Original artist: User:ZackWeinberg, based on PNG version by User:Cyc
17.3
Content license
Creative Commons Attribution-Share Alike 3.0

Support Vector Machine

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Support Vector Machine

Încărcat de

Drepturi de autor:

Formate disponibile

Support vector machine

Not to be confused with Secure Virtual Machine.

used by SVM schemes are designed to ensure that dot

are dened by the relation:

In machine learning, support vector machines (SVMs,

ration, or margin, between the two classes. So we choose

We are given a training dataset of n points of the form

Maximum-margin hyperplane and margins for an SVM trained

Any hyperplane can be written as the set of points x satisfying

Some common kernels include:

To extend SVM to cases in which the data are not linearly

Polynomial (homogeneous): k(xi , xj ) = (xi xj )d

where the parameter determines the tradeo between

max (0, 1 yi (w xi + b))

We focus on the soft-margin classier since, as noted

The original maximum-margin hyperplane algorithm

6 COMPUTING THE SVM CLASSIFIER

to subjectyi (xi w +b) 1i and i

By solving for the Lagrangian dual of the above problem,

Finally, new points can be classied by computing

Here, the variables ci are dened such that

Recent algorithms for nding the SVM classier include

The oset, b , can be recovered by nding an xi on the

Regularization and stability

7.2 Regularization and stability

For each i {1, . . . , n} , iteratively, the coecient ci is

Empirical risk minimization

The soft-margin support vector machine described above

(f ) = E [(yn+1 , f (Xn+1 ))]

7.3 SVM and the hinge loss

(y, z) = max (0, 1 yz) .

In most cases, we don't know the joint distribution of

Potential drawbacks of the SVM are the following three

SVMs belong to a family of generalized linear classiers

9.2 Multiclass SVM

The dominant approach for doing so is to reduce the

The eectiveness of SVM depends on the selection of

Transductive support vector machines

Transductive support vector machines extend SVMs 1 w2

10 Interpreting SVM models

The parameters of the maximum-margin hyperplane are

ally faster, and has better scaling properties for dicult

SVMs can be used to solve various real world problems:

In situ adaptive tabulation

Press, William H.; Teukolsky, Saul A.; Vetterling,

[4] Boser, B. E.; Guyon, I. M.; Vapnik, V. N. (1992).

[10] Rosasco, L; Vito, E; Caponnetto, A; Piana, M;

[24] Bilwaj Gaonkar, Christos Davatzikos Analytic estimation

[13] Duan, K. B.; Keerthi, S. S. (2005). Which Is the Best

[27] Ferris, M. C.; Munson, T. S. (2002).

[28] John C. Platt (1998). Sequential Minimal Optimization:

[15] Platt, John; Cristianini, N.; and Shawe-Taylor, J. (2000).

[29] Shai Shalev-Shwartz; Yoram Singer; Nathan Srebro

[16] Dietterich, Thomas G.; and Bakiri, Ghulum; Bakiri

[20] Joachims, Thorsten; Transductive Inference for Text

[21] Drucker, Harris; Burges, Christopher J. C.; Kaufman,

[22] Suykens, Johan A. K.; Vandewalle, Joos P. L.; Least

Huang, Te-Ming; Kecman, Vojislav; and Kopriva,

[23] Smola, Alex J.; Schlkopf, Bernhard (2004). A tutorial

Catanzaro, Bryan; Sundaram, Narayanan; and

Vapnik, Vladimir N.; The Nature of Statistical

videolectures.net (SVM-related video lectures)

Vapnik, Vladimir N.; and Kotz, Samuel; Estimation

libsvm LIBSVM is a popular library of SVM learners