Documente Academic
Documente Profesional
Documente Cultură
Linear Classifiers
The Perceptron
Logistic Regression
Decision Theory
Suppose we wish to make measurements on a medical image and
classify it as showing evidence of cancer or not
C1 cancer
image x
image decision C2 no cancer
processing rule
measurement
p(x, Ci ) = p(x|Ci)p(Ci )
How do we make the best decision?
Classification
Assign input vector x to one of two or more classes Ck
Any decision rule divides input space into decision regions separated
by decision boundaries
Also, would like a confidence measure (how sure are we that the
input belongs to the chosen category?)
Decision Boundary for average error
Consider a two class decision depending
on a scalar variable x x0 x
^
p(x,C1)
Z + p(x,C2)
p(error) = p(error, x) dx
Z Z
= p(x, C2) dx + p(x, C1) dx x
R1 R2
R1 R2
Bayes error
A classifier is a mapping from a vector x
to class labels {C1, C2}
x0 x
^
p(x,C1)
Z + p(x,C2)
p(error) = p(error, x) dx
Z Z x
= p(x, C2) dx + p(x, C1) dx
R1 R2
R1 R2
Z Z
= p(C2|x)p(x) dx + p(C1|x)p(x) dx
R1 R2
5
p(C1|x) p(C |x)
p(x|C2) 2
4 1
posterior probabilities
class densities
0.8
3
0.6 sum
2
p(x|C1) 0.4 to 1
1 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
p(C1|x) + p(C2|x) = 1,
so p(C2|x) = 1 p(C1|x)
i.e. class i if p(Ci |x) > 0.5
Reject option
avoid making decisions if unsure
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
x
reject region
0.9
0.8
0.7
0.6
g=G/(R+G+B)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r=R/(R+G+B)
skin pixels in chromaticity space
1
0.9
0.8
0.7
0.6
g=G/(R+G+B)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r=R/(R+G+B)
p(x|background) p(x|skin)
1 1
120
1200
0.9 0.9
0.7 0.7
80 800
0.6
g=G/(R+G+B)
0.6
g=G/(R+G+B)
0.5
0.5 60 600
0.4
0.4
40 0.3 400
0.3
0.2
0.2 200
20
0.1
0.1
0 0
0 0 0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1 r=R/(R+G+B)
r=R/(R+G+B)
0.45
0.4
g=G/(R+G+B)
0.35
0.3
0.25
0.2
100
80
input
60
40
20
P(x|skin)
1200
1000
800
600
400
200
P(x|skin) 1
P(skin|x)
1200
0.9
1000 0.8
0.7
800
0.6
600 0.5
0.4
400
0.3
0.2
200
0.1
0 0
likelihood posterior
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Test data
p(x|background)
p(x|skin)
p(skin|x)
p(skin|x)>0.5
Test performance on other frames
worse
performance
The two other cases are true positive and true negative.
X
Risk R(Ci|x) = Lij p(Cj |x)
j
cancer normal
0 1 cancer
Loss matrix Lij = classification
1000 0 normal
truth
measurement
X
R(ai|x) = L(ai |Cj )p(Cj |x)
j
action
loss incurred if action i
taken and true state is j
Bayes decision rule: select the action for which R(ai | x) is minimum
Likelihood ratio
Two category classification with loss function
Conditional risk
g(x) = 0
g(x) = 0
is a discriminant surface.
C2 C1
In 2D g(x) = 0 is a set of curves.
Then
p(x|C1) p(C1)
g(x) = ln + ln
p(x|C2) p(C2)
p(C1)
= ln p(x|C1) ln p(x|C2) + ln
p(C2)
(x 1)>1 > 1
1 (x 1) + (x 2) 2 (x 2) + c
0
where c0 = ln p(C1) 1 ln | | + 1 ln | |.
p(C )2 2 1 2 2
Case 1: i = 2I
Example in 2D
! ! " #
0 1 1 0
1 = 2 = i =
0 0 0 1
Case 3: i = arbitrary
The discriminant surface
X1
Linear classifiers
g(x) = w>x + w0
X1
linearly
separable
not
linearly
separable
g(xi ) = w>xi + w0
separates the categories for i = 1,n
X2 X2
X1 X1
w w xi
Pn
NB after convergence w = i ixi
6
Perceptron
example 4
-2
-4
-6
-8
-10
-15 -10 -5 0 5 10
6
wider margin
classifier 4
-2
-4
-6
-8
-10
-12
-15 -10 -5 0 5 10
Logistic Regression
ideally we would like to fit a discriminant function using regression
methods similar to those developed for ML and MAP parameter estimation
but there is not the equivalent of model + noise here, since we wish to
map all the spread out features in the same class to one label
to to +
(, ) (0, 1)
The logistic function or sigmoid function
1
0.9
0.8
0.7
1 0.6
( z) = 0.5
1 + ez 0.4
0.3
0.2
0.1
0
-20 -15 -10 -5 0 5 10 15 20
g(x) = w>x + w0
Notation: write the equation
more compactly as g(x) = w>x
e.g. in 2D
x1
g(x) = w2 w1 w 0 x2
1
1
( wtx) = t
1 + e w x
to the data { xi, yi } by minimizing the classification errors yi ( wtxi)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-20 -15 -10 -5 0 5 10 15 20
Maximum Likelihood Estimation
Assume
X
L(w) = >
yi (w xi ) xj
wj i
which gives the update rule
Note
this is similar, but not identical, to the perceptron update rule.
there is a unique solution for w
in n-dimensions it is only necessary to learn n+1 parameters. Compare
this with learning normal distributions where learning involves 2n
parameters for the means and n(n+1)/2 for a common covariance matrix
Application: hand written digit recognition
Example
hand
drawn
5
1 2 3 4 5 6 7 8 9 0
4
3
classification 5
1 2
2
5
3 4
5
5
0
Comparison of discriminant and generative approaches
Discriminant
+ dont have to learn parameters which arent used (e.g. covariance)
+ easy to learn
- no confidence measure
- have to retrain if dimension of feature vectors changed
Generative
+ have confidence measure
+ can use reject option
+ easy to add independent measurements
p( Ck |xA , xB ) p( xA , xB |Ck ) p( Ck )
p( xA |Ck ) p( xB |Ck ) p( Ck )
p( Ck |xA ) p( Ck |xB )
p( Ck )
- expensive to train (because many parameters)
Perceptrons (1969)
Recent progress in Machine Learning
Perceptron Non-examinable
Pn
g(x) = w>x where w = i i xi
X
g(x) = ixi>x
i
Generalize to
X
g( x) = i( xi) t( x)
i
where (x) is a map from x to a higher dimension.
For example, for x = (x1, x2)t
(x) = (x2 ,
1 2 x 2, 2x x )t
1 2
Example
(x1, x2) = (x2 2
1 , x2 , 2x1x2)t
Z= 2x1 x2
0
Y = x22 X = x21
Kernels
Generalize further to
X
g(x) = iK (xi, x)
i
where K(x, z) is a (non-linear) Kernel function.
For example
n o
2 2
K(x, z) exp (x z) /(2 )
is a radial basis function kernel, and
K(x, z) (x.z)n
is a polynomial kernel.
Exercise
If n = 2 show that