SVM PRSNTN

Support Vector Machine
Advanced Methods for Data Analysis-2018
Ujjwal Das
Indian Institute of Management Udaipur, Rj-313001
September 11, 2019
1 / 25
Outline
The classication problem
Hyperplane; use of hyperplane as separating tool
Separable data; Maximal margin classier and its mathematical

formulation
Non-separable data; Support vector classier and its

mathematical formulation
Classication with non-linear decision boundaries, the SVM with

kernel
SVM for more than two classes (will be discussed completely via
examples in R language)
2 / 25
The classication problem
Consider a simple example. Suppose we have a population

consists of men and women. We will draw a random sample from
the population and will come up with some rules that will guide
us to gender classication for the entire population.
So, using some set of rules we will be able to classify the entire
population into two groups.
The rules depend on some characteristics of the data that is

collected. For our example suppose there are two such
characteristics: height and hair length of the subjects. Let us plot
the hair length against the height to visualize a separable data
A real example of data suering from separation": a problem to

logistic regression
3 / 25
A separable data
Height vs Hair Color

80
60
Hair length
40
20
0
140 150 160 170 180
height
4 / 25
Hyperplane
In a p-dimensional space a hyperplane is a at ane subspace of

dimension (p − 1)
What is ane subspace?
The mathematical denition of a hyperplane in a p-dimensional

space is
β0 + β1 x 1 + β2 x 2 + . . . + βp x p = 0 (1)
where x = (x1 , x2 , . . . , xp ) is a point from p-dimension
If p=2 then (1) reduces to a one dimensional subspace (a

straight line) and for p=3 it's a plane.
So, if we nd a point x∗ = (x∗1 , x∗2 , . . . , x∗p ) that satises (1) then
∗
we say that the point x lies on the hyperplane.
Naturally, it is not the case always.
5 / 25
Separating Hyperplane
Let e = (e
x x1 , x ep )
e2 , . . . , x be a point from p-dimension and it does
not satisfy (1), rather it is > 0. this informs us that the point x
e
lies to one side of the hyperplane.
Similarly, if it is <0 then x

e will be on the other side of the
hyperplane.
Thus, one can think the hyperplane as separating the

p-dimensional space into two halves.
For a separable training data one can easily construct a

hyperplane and thereafter can use it for predicting the test and
future data
Suppose when f (x) = β0 + β1 xi1 + β2 xi2 + . . . + βp xip > 0 then

we label yi = 1 and yi = −1 when
β0 + β1 xi1 + β2 xi2 + . . . + βp xip < 0 for i = 1, 2, . . . , n subjects in
the training data.
6 / 25
Separating Hyperplanes
Combining them we nd a property of separating hyperplane that

yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) > 0 for every i = 1, 2, . . . , n
So, the simplest classier is f (x) where for an observation x if
f (x) > 0 then we will assign the test observation to the class +1
and if f (x) < 0 the it will be assigned class −1
The magnitude of f (x) is also important. It's high value indicates

that the point is far from the hyperplane and hence the user can
be condent about the classication. Similarly, if the value of
f (x) is close to zero means that x is located near the hyperplane
and one may be less certain about the class assignment.
Now a separating hyperplane can be shifted little bit upward and

downward or rotated and will create another hyperplane, without
coming into contact with any observations.
7 / 25
Separating Hyperplanes
Height vs Hair Color

80
60
Hair length
40
20
0
140 150 160 170 180
height
8 / 25
The maximal margin classier
Since there are innitely many possible separating hyperplanes,

we have to have a reasonable way to decide which of them to use
This necessity leads to maximal margin classier. This is the

separating hyperplane which is farthest from the training data
points
For a given hyperplane, one can compute its distance from each
of the training observations. The smallest of these distances is
called margin which is the minimum distance of the hyperplane
from the observations.
The maximal margin hyperplane is the one that has the largest
margin; or alternatively the maximal margin hyperplane has the
farthest minimum distance to the training data points
We can then classify a test observation based on which side of the

maximal margin hyperplane it lies
The coecients β = (β0 , β1 , . . . , βp ) will be obtained by solving

an optimization problem
9 / 25
A separable data and maximal margin classier
3
2
X2
1
0
−1
−1 0 1 2 3
X1
Figure: Margin and hyperplane for a separable data

10 / 25
The maximal margin classier: mathematical formulation
Mathematically speaking
max M (2)
β0 ,β1 ,...,βp
subject to
p
X
βj2 = 1
j=1
yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M
for every i = 1, 2, . . . , n
The constraint yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M
∀i = 1, 2, ..., n and M > 0 ensures that each training data point
will be on the correct side of the hyperplane
M is called the margin of the hyperplanes which will be

maximized to obtain β
11 / 25
The pros and cons of MMH
There are three training points that are equidistant from the
hyperplane and lie along the margin. The points on the margin
are called support vectors since they support the maximal
margin hyperplane. A small change in the support vectors may
dramatically change the maximal margin hyperplane
In other words, the maximal margin hyperplane heavily depend

on the support vector(s) but not on the movement of other
observations until it exceeds the boundary set by the margin
The exact separation of the training data may lead to over-tting

when the boundary/ margin is small and can create the issue of
sensitivity. Additionally, a tiny margin may be problematic in the
sense that the distance of an observation from hyperplane can be
thought as a measure of condence that the observation was
correctly classied
Finally, for a non-separable data the maximal margin hyperplane

does not exist
12 / 25
Non-separable data
Under non-separable data the above mentioned optimization

problem does not have any solution with M >0
Here one may use the separating hyperplane with slight exibility
that instead of exactly separating all training observations now
we will almost separate using a so-called soft margin
The extension of the maximal margin classier to the
non-separable data is known as support vector classier
It's also called soft margin classier since it allows some
observations on the incorrect side of the margin or even on the
incorrect side of the hyperplane
In fact it is inevitable when there is no separating hyperplane
13 / 25
A non-separable data
2.0
1.5
1.0
X2
0.5
0.0
−0.5
−1.0
0 1 2 3
X1
Figure: A non-separable data where maximal margin hyperplane is not

possible to construct
14 / 25
Support vector classier
As mentioned before the previous optimization problem does not

have any solution under non-separable data
Here, we have a slightly dierent mathematical formulation
max M (3)
β0 ,β1 ,...,βp ,1 ,2 ,...,n
subject to
p
X
βj2 = 1
j=1
yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M (1 − i )
Xn
i ≤ C
i=1
where i > 0 are slack variables and C is a non-negative tuning

parameter, ∀i = 1, 2, . . . , n
15 / 25
As before, M is the margin of the hyperplanes which will be

maximized to obtain β
For maximal margin hyperplane, the constraint
yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M ∀i = 1, 2, ..., n and
M > 0 ensures that each training data point will be on the
correct side of the hyperplane
Here yi f (xi ) > M (1 − i ) ensures that some of the observations

of training data are on the incorrect side of the margin and/ or
even some of them may be incorrect side of the hyperplane
i = 0 indicates that ith observation is on right side, i > 0

th
indicates that i observation has violated the margin; and nally
i > 1 means it is on the wrong side of the hyperplane
C is the tuning parameter playing the role of bias-variance

trade-o
16 / 25
The tuning parameter plays a very crucial role on the

determination of number and severity of the margin (and/or
hyperplane) violation
C" can be thought as a budget for the amount that the margin
can be violated by the n observations
What are the meaning of C=0 and C>0

It also ensures that no more than C observations can be on the
Pn
wrong side of the hyperplane since i=1 i ≤C
So, high value of C indicates that more violations to the margin
are tolerated that will widen the resulting margin
Following is an example when SVC was applied on a small

non-separable data set
17 / 25
A support vector classier
10 10
4
4
7 7
3
3
11
9 9
2
2
8 8
X2
X2
1
1
1 1
12
3 3
0
0
4 5 4 5
2 2
−1
−1
6 6
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
X1 X1
Figure: Support vector classier with margin and hyperplane for two
dierent choices of C"
18 / 25
Support vector classier on same data for choices of
tuning parameter
3
3
2
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−3
−3
−1 0 1 2 −1 0 1 2
X1 X1
3
3
2
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−3
−3
−1 0 1 2 −1 0 1 2
X1 X1
Figure: Support vector classier on same data for dierent values of C"
19 / 25
Small value of C leads to narrower margin having few margin

violations
Hence, the selection of C leads to bias-variance trade o: what

does it mean?
In practice the tuning parameter C is chosen via cross-validation
It can be shown that the observations that either lie on the

margin or have violated the margin have impact on the
hyperplane and hence the SVC
Changing position of other observations have no impact on the

classier as long as they remain on the correct side of the margin
Here, the observations/ vectors which lie on the margin or violate

the margin are known as support vectors
We see that the decision boundary resulting from SVC depend
only on a subset of the training data which means it is robust to
behavior of the observations that are far from the margin. This is
actually a distinct feature of this classier
20 / 25
Non-linear decision boundary: Support vector machine
SVC is natural approach for classication in two class setting,

when the boundary between the two class is somewhat linear
It may not happen always in real life where non-linear decision

boundary is not very rare
Remember the similarity with the discussion when one or more

predictors inuence the continuous response in a non-linear
fashion in multiple regression
Following is a visual example when non-linear decision boundary

is inevitable
21 / 25
A support vector classier on data having nonlinear
decision boundary
4
4
2
2
X2
X2
0
0
−2
−2
−4
−4
−4 −2 0 2 4 −4 −2 0 2 4
X1 X1
Figure: Non-linear decision boundary with two classes and Support vector
classier on them
22 / 25
Support vector machine
To address this non-linearity of the data we need to incorporate
some non-linear function of the features in the optimization
With quadratic term the optimization becomes
max M (4)
β0 ,β1 ,...,βp ,1 ,2 ,...,n
subject to
p
X
βj2 = 1
j=1
 
p
X p
X
yi β0 + βj1 xij + βj2 x2ij  ≥ M (1 − i )
j=1 j=1
n
X
i ≤ C
i=1
where i > 0 are slack variables and C is a non-negative tuning

parameter, ∀i = 1, 2, . . . , n
23 / 25
Support vector machine
Due to the quadratic term involved in the optimization results a

non-linear decision boundary since the solutions of a quadratic
equation is, in general, non-linear
Denitely one may involve higher order polynomials as well as

two factor interaction of the form Xj Xk with j 6= k
Several other function of the predictors are also allowed
SVM is an extension of SVC that results from enlarging the

feature space in a specic way, using kernels K(xj , xk )
Dierent functional form of kernel yields dierent types of SVM
polynomial kernel
Pp
e.g.K(xi , xi0 ) = (1 + j=1 xij xi0 j )d is called
with degree d ; here d ≥ 1 is an integer
Another popular choice is radial kernel given by
 
p
X
K(xi , xi0 ) = exp −γ (xij − xi0 j )2  (5)
j=1
24 / 25
Support vector machine with a polynomial kernel of d=3
and radial kernel
4
4
2
2
X2
X2
0
0
−2
−2
−4
−4
−4 −2 0 2 4 −4 −2 0 2 4
X1 X1
Figure: Support vector machine with polynomial kernel at left panel and
radial kernel at right panel
25 / 25

SVM PRSNTN

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

SVM PRSNTN

Încărcat de

Drepturi de autor:

Formate disponibile

Support Vector Machine

Advanced Methods for Data Analysis-2018

Indian Institute of Management Udaipur, Rj-313001

September 11, 2019

The classication problem

Hyperplane; use of hyperplane as separating tool

Separable data; Maximal margin classier and its mathematical

Non-separable data; Support vector classier and its

Classication with non-linear decision boundaries, the SVM with

Consider a simple example. Suppose we have a population

The rules depend on some characteristics of the data that is

A real example of data suering from separation": a problem to

Height vs Hair Color

140 150 160 170 180

In a p-dimensional space a hyperplane is a at ane subspace of

The mathematical denition of a hyperplane in a p-dimensional

where x = (x1 , x2 , . . . , xp ) is a point from p-dimension

If p=2 then (1) reduces to a one dimensional subspace (a

Naturally, it is not the case always.

Similarly, if it is <0 then x

Thus, one can think the hyperplane as separating the

For a separable training data one can easily construct a

Suppose when f (x) = β0 + β1 xi1 + β2 xi2 + . . . + βp xip > 0 then

Combining them we nd a property of separating hyperplane that

The magnitude of f (x) is also important. It's high value indicates

Now a separating hyperplane can be shifted little bit upward and

Height vs Hair Color

140 150 160 170 180

Since there are innitely many possible separating hyperplanes,

This necessity leads to maximal margin classier. This is the

We can then classify a test observation based on which side of the

The coecients β = (β0 , β1 , . . . , βp ) will be obtained by solving

Figure: Margin and hyperplane for a separable data

M is called the margin of the hyperplanes which will be

In other words, the maximal margin hyperplane heavily depend

The exact separation of the training data may lead to over-tting

Finally, for a non-separable data the maximal margin hyperplane

Under non-separable data the above mentioned optimization

In fact it is inevitable when there is no separating hyperplane

Figure: A non-separable data where maximal margin hyperplane is not

As mentioned before the previous optimization problem does not

Here, we have a slightly dierent mathematical formulation

where i > 0 are slack variables and C is a non-negative tuning

As before, M is the margin of the hyperplanes which will be

Here yi f (xi ) > M (1 − i ) ensures that some of the observations

i = 0 indicates that ith observation is on right side, i > 0

C is the tuning parameter playing the role of bias-variance

The tuning parameter plays a very crucial role on the

What are the meaning of C=0 and C>0

Following is an example when SVC was applied on a small

Small value of C leads to narrower margin having few margin

Hence, the selection of C leads to bias-variance trade o: what

In practice the tuning parameter C is chosen via cross-validation

It can be shown that the observations that either lie on the

Changing position of other observations have no impact on the

Here, the observations/ vectors which lie on the margin or violate

SVC is natural approach for classication in two class setting,

It may not happen always in real life where non-linear decision

Remember the similarity with the discussion when one or more

Following is a visual example when non-linear decision boundary

With quadratic term the optimization becomes

where i > 0 are slack variables and C is a non-negative tuning

Due to the quadratic term involved in the optimization results a

The classication problem

Separable data; Maximal margin classier and its mathematical

Non-separable data; Support vector classier and its

Classication with non-linear decision boundaries, the SVM with

A real example of data suering from separation": a problem to

In a p-dimensional space a hyperplane is a at ane subspace of

The mathematical denition of a hyperplane in a p-dimensional

Combining them we nd a property of separating hyperplane that

Since there are innitely many possible separating hyperplanes,

This necessity leads to maximal margin classier. This is the

The coecients β = (β0 , β1 , . . . , βp ) will be obtained by solving

The exact separation of the training data may lead to over-tting

Here, we have a slightly dierent mathematical formulation

where i > 0 are slack variables and C is a non-negative tuning

Here yi f (xi ) > M (1 − i ) ensures that some of the observations

i = 0 indicates that ith observation is on right side, i > 0

Hence, the selection of C leads to bias-variance trade o: what

SVC is natural approach for classication in two class setting,

where i > 0 are slack variables and C is a non-negative tuning

Denitely one may involve higher order polynomials as well as