Sunteți pe pagina 1din 25

Support Vector Machine

Advanced Methods for Data Analysis-2018

Ujjwal Das

Indian Institute of Management Udaipur, Rj-313001

September 11, 2019

1 / 25
Outline

The classication problem

Hyperplane; use of hyperplane as separating tool

Separable data; Maximal margin classier and its mathematical


formulation

Non-separable data; Support vector classier and its


mathematical formulation

Classication with non-linear decision boundaries, the SVM with


kernel

SVM for more than two classes (will be discussed completely via
examples in R language)

2 / 25
The classication problem

Consider a simple example. Suppose we have a population


consists of men and women. We will draw a random sample from
the population and will come up with some rules that will guide
us to gender classication for the entire population.

So, using some set of rules we will be able to classify the entire
population into two groups.

The rules depend on some characteristics of the data that is


collected. For our example suppose there are two such
characteristics: height and hair length of the subjects. Let us plot
the hair length against the height to visualize a separable data

A real example of data suering from separation": a problem to


logistic regression

3 / 25
A separable data

Height vs Hair Color


80
60
Hair length

40
20
0

140 150 160 170 180

height

4 / 25
Hyperplane

In a p-dimensional space a hyperplane is a at ane subspace of


dimension (p − 1)
What is ane subspace?

The mathematical denition of a hyperplane in a p-dimensional


space is
β0 + β1 x 1 + β2 x 2 + . . . + βp x p = 0 (1)

where x = (x1 , x2 , . . . , xp ) is a point from p-dimension

If p=2 then (1) reduces to a one dimensional subspace (a


straight line) and for p=3 it's a plane.

So, if we nd a point x∗ = (x∗1 , x∗2 , . . . , x∗p ) that satises (1) then

we say that the point x lies on the hyperplane.

Naturally, it is not the case always.

5 / 25
Separating Hyperplane

Let e = (e
x x1 , x ep )
e2 , . . . , x be a point from p-dimension and it does
not satisfy (1), rather it is > 0. this informs us that the point x
e
lies to one side of the hyperplane.

Similarly, if it is <0 then x


e will be on the other side of the
hyperplane.

Thus, one can think the hyperplane as separating the


p-dimensional space into two halves.

For a separable training data one can easily construct a


hyperplane and thereafter can use it for predicting the test and
future data

Suppose when f (x) = β0 + β1 xi1 + β2 xi2 + . . . + βp xip > 0 then


we label yi = 1 and yi = −1 when
β0 + β1 xi1 + β2 xi2 + . . . + βp xip < 0 for i = 1, 2, . . . , n subjects in
the training data.

6 / 25
Separating Hyperplanes

Combining them we nd a property of separating hyperplane that


yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) > 0 for every i = 1, 2, . . . , n
So, the simplest classier is f (x) where for an observation x if
f (x) > 0 then we will assign the test observation to the class +1
and if f (x) < 0 the it will be assigned class −1

The magnitude of f (x) is also important. It's high value indicates


that the point is far from the hyperplane and hence the user can
be condent about the classication. Similarly, if the value of
f (x) is close to zero means that x is located near the hyperplane
and one may be less certain about the class assignment.

Now a separating hyperplane can be shifted little bit upward and


downward or rotated and will create another hyperplane, without
coming into contact with any observations.

7 / 25
Separating Hyperplanes

Height vs Hair Color


80
60
Hair length

40
20
0

140 150 160 170 180

height

8 / 25
The maximal margin classier

Since there are innitely many possible separating hyperplanes,


we have to have a reasonable way to decide which of them to use

This necessity leads to maximal margin classier. This is the


separating hyperplane which is farthest from the training data
points

For a given hyperplane, one can compute its distance from each
of the training observations. The smallest of these distances is
called margin which is the minimum distance of the hyperplane
from the observations.

The maximal margin hyperplane is the one that has the largest
margin; or alternatively the maximal margin hyperplane has the
farthest minimum distance to the training data points

We can then classify a test observation based on which side of the


maximal margin hyperplane it lies

The coecients β = (β0 , β1 , . . . , βp ) will be obtained by solving


an optimization problem

9 / 25
A separable data and maximal margin classier

3
2
X2
1
0
−1

−1 0 1 2 3

X1

Figure: Margin and hyperplane for a separable data


10 / 25
The maximal margin classier: mathematical formulation

Mathematically speaking

max M (2)
β0 ,β1 ,...,βp

subject to

p
X
βj2 = 1
j=1
yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M

for every i = 1, 2, . . . , n
The constraint yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M
∀i = 1, 2, ..., n and M > 0 ensures that each training data point
will be on the correct side of the hyperplane

M is called the margin of the hyperplanes which will be


maximized to obtain β

11 / 25
The pros and cons of MMH

There are three training points that are equidistant from the
hyperplane and lie along the margin. The points on the margin
are called support vectors since they support the maximal
margin hyperplane. A small change in the support vectors may
dramatically change the maximal margin hyperplane

In other words, the maximal margin hyperplane heavily depend


on the support vector(s) but not on the movement of other
observations until it exceeds the boundary set by the margin

The exact separation of the training data may lead to over-tting


when the boundary/ margin is small and can create the issue of
sensitivity. Additionally, a tiny margin may be problematic in the
sense that the distance of an observation from hyperplane can be
thought as a measure of condence that the observation was
correctly classied

Finally, for a non-separable data the maximal margin hyperplane


does not exist

12 / 25
Non-separable data

Under non-separable data the above mentioned optimization


problem does not have any solution with M >0
Here one may use the separating hyperplane with slight exibility
that instead of exactly separating all training observations now
we will almost separate using a so-called soft margin
The extension of the maximal margin classier to the
non-separable data is known as support vector classier
It's also called soft margin classier since it allows some
observations on the incorrect side of the margin or even on the
incorrect side of the hyperplane

In fact it is inevitable when there is no separating hyperplane

13 / 25
A non-separable data

2.0
1.5
1.0
X2
0.5
0.0
−0.5
−1.0

0 1 2 3

X1

Figure: A non-separable data where maximal margin hyperplane is not


possible to construct
14 / 25
Support vector classier

As mentioned before the previous optimization problem does not


have any solution under non-separable data

Here, we have a slightly dierent mathematical formulation

max M (3)
β0 ,β1 ,...,βp ,1 ,2 ,...,n

subject to

p
X
βj2 = 1
j=1
yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M (1 − i )
Xn
i ≤ C
i=1

where i > 0 are slack variables and C is a non-negative tuning


parameter, ∀i = 1, 2, . . . , n

15 / 25
Support vector classier

As before, M is the margin of the hyperplanes which will be


maximized to obtain β
For maximal margin hyperplane, the constraint
yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M ∀i = 1, 2, ..., n and
M > 0 ensures that each training data point will be on the
correct side of the hyperplane

Here yi f (xi ) > M (1 − i ) ensures that some of the observations


of training data are on the incorrect side of the margin and/ or
even some of them may be incorrect side of the hyperplane

i = 0 indicates that ith observation is on right side, i > 0


th
indicates that i observation has violated the margin; and nally
i > 1 means it is on the wrong side of the hyperplane

C is the tuning parameter playing the role of bias-variance


trade-o

16 / 25
Support vector classier

The tuning parameter plays a very crucial role on the


determination of number and severity of the margin (and/or
hyperplane) violation

C" can be thought as a budget for the amount that the margin
can be violated by the n observations

What are the meaning of C=0 and C>0


It also ensures that no more than C observations can be on the
Pn
wrong side of the hyperplane since i=1 i ≤C
So, high value of C indicates that more violations to the margin
are tolerated that will widen the resulting margin

Following is an example when SVC was applied on a small


non-separable data set

17 / 25
A support vector classier

10 10
4

4
7 7
3

3
11
9 9
2

2
8 8
X2

X2
1

1
1 1
12
3 3
0

0
4 5 4 5
2 2
−1

−1
6 6

−0.5 0.0 0.5 1.0 1.5 2.0 2.5 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

X1 X1

Figure: Support vector classier with margin and hyperplane for two
dierent choices of C"

18 / 25
Support vector classier on same data for choices of
tuning parameter
3

3
2

2
1

1
X2

X2
0

0
−1

−1
−2

−2
−3

−3
−1 0 1 2 −1 0 1 2

X1 X1
3

3
2

2
1

1
X2

X2
0

0
−1

−1
−2

−2
−3

−3

−1 0 1 2 −1 0 1 2

X1 X1

Figure: Support vector classier on same data for dierent values of C"

19 / 25
Support vector classier

Small value of C leads to narrower margin having few margin


violations

Hence, the selection of C leads to bias-variance trade o: what


does it mean?

In practice the tuning parameter C is chosen via cross-validation

It can be shown that the observations that either lie on the


margin or have violated the margin have impact on the
hyperplane and hence the SVC

Changing position of other observations have no impact on the


classier as long as they remain on the correct side of the margin

Here, the observations/ vectors which lie on the margin or violate


the margin are known as support vectors
We see that the decision boundary resulting from SVC depend
only on a subset of the training data which means it is robust to
behavior of the observations that are far from the margin. This is
actually a distinct feature of this classier

20 / 25
Non-linear decision boundary: Support vector machine

SVC is natural approach for classication in two class setting,


when the boundary between the two class is somewhat linear

It may not happen always in real life where non-linear decision


boundary is not very rare

Remember the similarity with the discussion when one or more


predictors inuence the continuous response in a non-linear
fashion in multiple regression

Following is a visual example when non-linear decision boundary


is inevitable

21 / 25
A support vector classier on data having nonlinear
decision boundary
4

4
2

2
X2

X2
0

0
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

Figure: Non-linear decision boundary with two classes and Support vector
classier on them
22 / 25
Support vector machine
To address this non-linearity of the data we need to incorporate
some non-linear function of the features in the optimization

With quadratic term the optimization becomes

max M (4)
β0 ,β1 ,...,βp ,1 ,2 ,...,n

subject to

p
X
βj2 = 1
j=1
 
p
X p
X
yi β0 + βj1 xij + βj2 x2ij  ≥ M (1 − i )
j=1 j=1
n
X
i ≤ C
i=1

where i > 0 are slack variables and C is a non-negative tuning


parameter, ∀i = 1, 2, . . . , n
23 / 25
Support vector machine

Due to the quadratic term involved in the optimization results a


non-linear decision boundary since the solutions of a quadratic
equation is, in general, non-linear

Denitely one may involve higher order polynomials as well as


two factor interaction of the form Xj Xk with j 6= k
Several other function of the predictors are also allowed

SVM is an extension of SVC that results from enlarging the


feature space in a specic way, using kernels K(xj , xk )
Dierent functional form of kernel yields dierent types of SVM
polynomial kernel
Pp
e.g.K(xi , xi0 ) = (1 + j=1 xij xi0 j )d is called
with degree d ; here d ≥ 1 is an integer
Another popular choice is radial kernel given by

 
p
X
K(xi , xi0 ) = exp −γ (xij − xi0 j )2  (5)
j=1

24 / 25
Support vector machine with a polynomial kernel of d=3
and radial kernel
4

4
2

2
X2

X2
0

0
−2

−2
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

X1 X1

Figure: Support vector machine with polynomial kernel at left panel and
radial kernel at right panel
25 / 25

S-ar putea să vă placă și