cs276a SVM Review

SVM Friend or Foe?

We will go through what SVMs are and why they are good. SVMs are linear classifiers (a line in 2 dimensions, a lane in ! dimensions, a n"# dimensional hy er lane in n dimensions$ and they are good for ! good reasons and # very good reason.

Reason 1
%onsider the following la&eled training data and a oint that we want to classify.

We classify the un'nown oint &y fitting a classifier to the la&eled training data (e.g. training$ and then having it ma'e the classification decision. %onsider the following two classifiers( )his one is linear, everything on the left is classified as class + , everything on the right class - . *t assigns class + to the oint we are trying to classify.

)his one essentially memori+ed class + . ,verything in the immediate neigh&orhood of the + oints is considered to &e in the + class, everything outside is class - . *t assigns class - to the oint we are classifying.

Which one did &etter is a hiloso hical -uestion. We will say that the linear one did a &etter .o&, you may agree with that statement, or you may disagree. )here is no right answer here, &ut in the average case it a ears li'e the linear classifier ca tured more from the data, we have higher confidence that if it sees more oints from the same distri&ution it will la&el more of them correctly / that it will generalize &etter. )his essentially is 0ccam1s ra+or, all things &eing e-ual, we refer sim ler hy othesis2 sometimes even if the sim ler hy othesis does not correctly decide our training data / notice that the linear classifier misclassifies one of the + oints. 3 &it more formally the linear classifier is sim ler since it needs less arameters. 3ny line can &e s ecified &y two arameter m and & (as in y4m56&$, so the linear classifier is entirely determined &y two arameters. )he second classifier memori+ed 7 oints, each with 2


cs276a SVM Review

coordinates, so it needs at least #8 arameters#. We say that the linear classifier has higher &ias (4lower variance4lower ca acity$ than the second classifier, or e-uivalently that the second classifier has higher variance (4higher ca acity4 lower &ias$. )a'e the a&ove with a grain of salt, it as true as (and suffers from the same ro&lems as$ 0ccam1s ra+or. 9evertheless in machine learning ractice over"memori+ing your data is a very common ro&lem (called overtraining$2 not having your model have enough ca acity does not ha en nearly as much, so we1ll go with it and say that( LINEAR CLASSIFIERS ARE GOOD

Reason 2
:et us loo' at the following 2 dimensional dataset, 5 8 # " # # " # y class 8 # + # + " # " # + +

)o &e honest, due to the author1s la+iness, as drawn the circles have different si+es, so the model was also a&le to s ecify a radius for each of the five circles / another 7 arameters for a total of #8674#7. )o &e even more honest, due to even more la+iness, some of the circles are really elli ses, so the model was a&le to fit 7 elli ses. )he e-uation for an elli se is
( x +r $ 2 ( y +q $ 2 + =# , a b

so ; arameters for each of the 7 elli ses2

thus the model re-uires at least ;<7428 arameters.


cs276a SVM Review

We have 7 oints " ; in class + and # in class - . )his dataset is not linearly se ara&le, we1ll never &e a&le to find a line (also called a linear decision surface, or a linear classifier$ such that all the + are on one side and all the - are on the other. *n order to try to ma'e the data linearly se ara&le we can com ute more features from the data, for e5am le we can add the feature 52 to every data oint. 0ur dataset re resented in this higher dimensional feature space is then( We see that #$ if 5248, the oint is in class 2$ if 524#, the oint is in class + We have made the decision ro&lem much sim ler2 let us consider what this loo's li'e geometrically. =ere is the surface f(5,y$452 5 8 # " # # " # y 52 class 8 8 # # + # # + " # " # # # + +

:et us lot our feature s ace (5,y,52$2 they lie on this surface


)he + oints lie a&ove the - oint, so the data is now linearly se ara&le (a linear classifier in ! dimensions is a lane$.


cs276a SVM Review

We can draw a lane such that + oints lie a&ove and the - oint &elow. 0ur classifier (the lane$ intersects the surface in two lines, we call this our decision oundar! %lass - is all oints on the surface &etween these lines, class + is all the other oints on the surface. What does this decision &oundary loo' li'e in our original 2d s ace> *n feature s ace, the oints are (5,y,5 2$, to go &ac' we dro the +"coordinate, which geometrically means to loo' straight down. ?oing that we see that our decision &oundary &ecomes

decision &oundary

decision &oundary

decision decision &oundary &oundary

:et us do another e5am le, we will ic' a different ! rd feature and see what ha ens. We select 526y2 as our new feature. @roceeding as &efore we com ute our new data. 3gain we notice 5 y 526y2 class 8 8 8 2 2 #$ if 5 6y 48, then class - 2 # # 2 + 2 2 2$ if 5 6y 42, then class + . " # 2 + # So we e5 ect the data to have &ecome # " 2 + linearly se ara&le when considered in # this higher dimensional feature s ace. " " 2 + # # )he surface f(5,y$4526y2


cs276a SVM Review

=ere are our feature s ace (5,y,526y2$


)hey are indeed linearly se ara&le though our original data is not.

We find a lane (4linear classifier$ such that class + is a&ove and class - is &elow )he intersection of the lane with the &owl"sha ed surface is a circle2 everything a&ove that circle on the &owl is classified class + , everything &elow class - . :et us again consider what this decision &oundary is in our original 2d s ace. 3s &efore, we dro the +"coordinate &y loo'ing straight down and so the decision &oundary &ac' in our original s ace is a circle.
decision &oundary decision &oundary

)hus we have the second reason why SVMs are terrific, A "IG"ER DI#ESIONAL FEA$%RE S&ACE IS GOOD Aecause things that were not linearly se ara&le &efore &ecome linearly se ara&le. 9ote that we never 'now which features will ma'e our data linearly se ara&le, 52 and 526y2 were luc'y guesses, &ut if we ic' many non"linear features (li'e 5y, sin(5$, log(5 2y$, etc$, then there is a good chance our data will &ecome linearly se ara&le, even though there is no certainty.


cs276a SVM Review

3 caveat( an SVM does not .ust choose any lane that se arates our data, it ic's the one that ma5imi+es the geometric margin &etween the data and the lane. *n the e5am le a&ove, we could have ic'ed a lane .ust a little &it &elow the + oints / it would still se arate the data, &ut we are &etter off ic'ing the lane in the middle &etween the two classes. )he tas' of finding this uni-ue lane is an o timi+ation ro&lem / out of all ossi&le lanes we have to find the one that ma5imi+es a certain constraint (the geometric margin$. *t turns out that this can &e cast as a -uadratic o timi+ation ro&lem, and we will not worry a&out it too much e5ce t to say that it can &e done.

Reason 3
%onsider the following datasets together with their o timal se arating lane (which in 2 dimensions is a line$

We can see that only the four oint that are alone in the leftmost icture matter in determining the se arating lane. We call these oints support 'ectors. :et us concentrate on the case where we only have su ort vectors and let us consider how recisely we do the classification. )he o timal linear classifier (in the geometric margin sense$ that we see in the first icture on the left is the line y45. )his e-uation, written as 845"y means that any oint on the line (li'e (!,!$$ will give +ero when we su&tract y from 5. =owever, the function f(5,y$45"y can gives us more than that, for any oint (5,y$, if f(5,y$B8, then the oint lies to the left of the lane2 if f(5,y$C8, then the oint lies to the right of the lane, so f(5,y$ is the classifier. Dor any oint 4( 5, y$ that we wish to classify we decide is in #$ class + if f( 5, y$C8 2$ class - if f( 5, y$B8 We1ve seen that our classifier is com letely determined &y the su ort vectors, so it must &e the case that the e-uation 5"y is com letely determined &y our su ort vectors, &ut how> Dor each su ort vector we have three ieces of information, its two coordinates, and its classification, this has to &e enough to get 5"y. We have ; su ort vectors( ("#,8$ and (8,#$ in class - 2 (8,"#$ and (#,8$ in class + . :et us use 6# for class + and /# for class - . )hen .ust summing together all the information from our su ort vectors we have - ("#,8$ - (8,#$ + (8,"#$ + (#,8$4 "#<("#,8$"#<(8,#$6#<(8,"#$6#<(#,8$4(2,"2$


cs276a SVM Review

and now its o&vious that if we .ust dot roduct (5,y$ with the result we get (5,y$E(2," 2$425"2y. We are off &y a constant, &ut the magic -uadratic o timi+ation rocedure that find our o timal lane gives us a constant for each su ort vector, which in this case we see is F, and we have our classifier, which in this articular e5am le is f(5,y$4(5,y$EG" F ("#,8$" F (8,#$6 F (8,"#$6 F (#,8$H45"y )o reca , #$ We start with our dataset 2$ *t uni-uely determines a lane that &est se arates the data !$ We feed the data through a -uadratic o timi+ation rocedure which finds this lane ;$ )he rocedure tells us what this lane is &y giving us a constant for every data oint a. )his constant is +ero if the oint is not a su ort vector (it must &e +ero, since as we saw &efore any oint that is not a su ort vector can have no im act on the lane$ &. *ts non"+ero for every su ort vector (in the e5am le a&ove this constant was F for every su ort vector, in general its going to &e a different value for each$ c. %aveat( if our se arating lane does not ass through origin (as it did in the e5am le a&ove$ the rocedure will also give us a way to com ute an interce t term as a function of the su ort vectors, &ut let us not worry too much a&out it here 7$ )o classify a oint a. *f you are a erson i. you loo' what side of the se arating lane is on &. *f you are a com uter i. you dot roduct with every (su ort vector < its constant from the o timi+ation rocedure < its class value (6# or /#$$, and sum everything together. *f the result is B8 then you decide is in class - , if C8 is in class + (if the result is 8, then lies on the se arating lane and you are screwed$ We have reason three( $"E CLASSIFICA$ION DECISION FOR A &OIN$ P CAN (E E)&RESSED AS A F%NC$ION OF DO$ &ROD%C$S OF P *I$" $"E S%&&OR$ +EC$ORS *f this reason seems a &it du&ious, don1t worry, it will come into its own in the ne5t section.

Reason 4
3s we saw a&ove, classification of a new vector is mainly decided &y ta'ing dot roducts &etween our su ort vectors and the new vector we are classifying. :et us ta'e a closer "7"

cs276a SVM Review

loo' at the dot roduct o eration. Su ose our data is one dimensional, we have two vectors 5 and +2 let us e5 and to a two dimensional feature s ace &y adding a feature !5. )hen we have
original data x z
e5 and

data in feature s ace ( x,! x$ ( z ,! z $

:et us ta'e a dot roduct in feature s ace. We do it in the usual manner, multi ly com onent &y com onent and sum u ,
x ( x,! x $ ( z ,! z $ = z xz + !x !z I xz =#8 xz

9ow notice something reall! important( if we want to com ute a dot roduct &etween two vectors in feature s ace, we can either e5 and each one and do the dot roduct in the usual manner, or we can .ust com ute #8xz, which as we saw a&ove is the same thing. )his does not seem im ortant only &ecause the feature s ace e5 ansion we1ve chosen here is trivial. %onsider another e5am le, this time our data is two dimensional, we again have two 2 , 2 x# , 2 x2 $ . vectors (5#,52$ and (+#,+2$. We choose the e5 ansion (#, x#2 , 2 x# x2 , x2 )his is a very owerful set of features2 remem&er in the e5am les &efore sim ly choosing 52 as a feature made the data linearly se ara&le, this articular e5 ansion is more owerful, so we e5 ect even more datasets that were not linearly se ara&le in 2 dimensions to &e linearly se ara&le in this feature s ace. We have
original data data in feature s ace e5 and 2 2 ( x# , x 2 $ (#, x# , 2 x# x 2 , x 2 , 2 x# , 2 x 2 $ 2 2 ( z# , z 2 $ (#, z# , 2 z# z 2 , z 2 , 2 z# , 2 z 2 $

%onsider the dot roduct &etween two vectors in the feature s ace(
2 2 2 (#, x# , 2 x# x 2 , x 2 , 2 x# , 2 x 2 $ (#, z#2 , 2 z# z 2 , z 2 , 2 z# , 2 z 2 $ =

# # # +

x#2 z#2 x z
2 2 # #

2 x# x 2 2 z# z 2 + 2 x# x 2 z# z 2 +

2 x2 2 z2

2 x# 2 z#
2 2

2 x2 2z2 + 2 x2 z 2 = (# + x# z# + x 2 z 2 $ 2

x z

2 2

+ 2 x# z#

)his means we can com ute the dot roduct in this high dimensional feature s ace either the hard way " &y e5 anding the vectors and doing the regular dot roduct, or the easy way " &y com uting (# + x# z# + x2 z 2 $ 2 . Remem&er that to do the classification in the feature s ace all we need is to &e a&le to ta'e dot roducts. )his means we can do the classification in this 6 dimensional feature s ace &y sim ly com uting (# + x# z# + x2 z 2 $ 2 from our original 2 dimensional data.


cs276a SVM Review

9ow imagine our feature s ace is #,888,888,888,888 dimensional / as long as the dot roduct in that huge s ace is the same as some sim le function on our tiny dimensional, original data, we can e5tremely chea ly do linear classification in that enormous feature s ace. We call K ( ( x# , x2 $, ( z# , z2 $ ) = (# + x# z# + x2 z2 $ 2 a ,ernel function. ,very function that corres onds to the dot roduct in some articular feature s ace e5 ansion is a 'ernel function, and given a 'ernel function we can do linear classification in that feature s ace. Dor e5am le K ( ( x# , x2 $, ( z# , z2 $ ) = (# + x# z# + x2 z2 $ d for any d># is a 'ernel function, and is called a polynomial kernel2. )hus we have our final, very good reason why SVMs are good *E CAN RE&RESEN$ A DO$ &ROD%C$ IN A "IG" DI#ENSIONAL FEA$%RE S&ACE AS A SI#&LE F%NC$ION ON O%R ORIGINAL DA$A

Which means we can classify in that feature s ace using this sim le 'ernel function. 3 reca ( SVM classification is a name for the following #$ We ta'e our data 2$ We ma it to a high dimensional feature s ace where it will ro&a&ly &ecome linearly se ara&le !$ We find an o timal se arating lane in the feature s ace ;$ We are a&le to e5 ress the classification decision in the feature s ace in terms of dot roducts of vectors in the feature s ace 7$ We erform the classification decision &y doing the dot roducts indirectly, using an e-uivalent 'ernel function which is much sim ler, and much chea er to com ute

K ( ( x# , x2 , , xn $, ( z# , z2 , , zn $ ) = (# + x# z# + x2 z2 + + xn zn $ d .

9ote that our original data can &e n"dimensional2 the

olynomial 'ernel function is then


