Sunteți pe pagina 1din 31

CS340 Machine learning

Lecture 4
K-nearest neighbors
Nearest neighbor classifier
• Remember all the training data (non-parametric
classifier)
• At test time, find closest example in training set,
and return corresponding label
ŷ(x) = yn∗ where n∗ = arg min dist(x, xn )
n∈D

?
K-nearest neighbor (kNN)
• We can find the K nearest neighbors, and return
the majority vote of their labels
• Eg y(X1) = x, y(X2) = o
Effect of K
• K yields smoother predictions, since we average
over more data
• K=1 yields y=piecewise constant labeling
• K = N predicts y=globally constant (majority) label
K=1 K=15

Fig 2.2, 2.3 of HTF01


Decision boundary for K=1
• Decision boundary is piecewise linear; each piece
is a hyperplane that is perpendicular to the bisector
of pairs of points from different classes (Voronoi
tessalation)

DHS 4.13
Model selection
• Degrees of freedom ≈ N/K, since if neighborhoods
don’t overlap, there would be N/K n’hoods, with one
label (parameter) each
• K=1 yields zero training error, but badly overfits
K=20 K=1

Test error

error
Train error

dof=5 dof=100
Model selection

• If we use empirical error to choose H (models), we


will always pick the most complex model
Approaches to model selection
• We can choose the model which optimizes the fit to
the training data minus a complexity penalty

H = arg max fit(H|D) − λcomplexity(H)
H
• Complexity can be measured in various ways
– Parameter counting
– VC dimension
– Information-theoretic encoding length
• We will see some examples later in class
Validation data
• Alternatively, we can estimate performance of each
model on a validation set (not used to fit the model)
and use this to select the right H.
• This is an estimate of the generalization error.
• Once we have chosen the model, we refit it to all
the data, and report performance on a test set.

N
valid
1
E[err] ≈ I(ŷ(xn ) = yn )
Nvalid n=1
K-fold cross validation
If D is so small that Nvalid would be an unreliable
estimate of the generalization error, we can
repeatedly train on all-but-1/K and test on 1/K’th.
Typically K=10.
If K=N-1, this is called leave-one-out-CV.
CV for kNN
• In hw1, you will implement CV and use it to select K
for a kNN classifier
• Can use the “one standard error” rule*, where we
pick the simplest model whose error is no more
than 1 se above the best.
• For KNN, dof=N/K, so we would pick K=11.

CV error

* HTF p216 K
Application of kNN to pixel labeling
LANDSAT images for an agricultural area in 4 spectral bands;
manual labeling into 7 classes (red soil, cotton, vegetation, etc.);
Output of 5NN using each 3x3 pixel block in all 4 channels (9*4=36 dimensions).
This approach outperformed all other methods in the STATLOG project.

HTF fig 13.6, 13.7


Problems with kNN
• Can be slow to find nearest nbr in high dim space
n∗ = arg min dist(x, xn )
n∈D
• Need to store all the training data, so takes a lot of
memory
• Need to specify the distance function
• Does not give probabilistic output
Reducing run-time of kNN
• Takes O(Nd) to find the exact nearest neighbor
• Use a branch and bound technique where we
prune points based on their partial distances
r

Dr (a, b)2 = (ai − bi )2
i=1

• Structure the points hierarchically into a kd-tree


(does offline computation to save online
computation)
• Use locality sensitive hashing (a randomized
algorithm)

Not on exam
Reducing space requirements of kNN
• Various heuristic algorithms have been proposed to
prune/ edit/ condense “irrelevant” points that are far
from the decision boundaries
• Later we will study sparse kernel machines that
give a more principled solution to this problem

Not on exam
Similarity is hard to define
“tufa”

“tufa”

“tufa”
Euclidean distance
• For real-valued feature vectors, we can use
Euclidean distance d

D(u, v)2 = ||u − v||2 = (u − v)T (u − v) = (ui − vi )2
i=1

• If we scale x1 by 1/3, NN changes!


Mahalanobis distance
• Mahalanobis distance lets us put different weights
on different comparisons

D(u, v)2 = (u − v)T Σ(u − v)



= (ui − vi )Σij (uj − vj )
i j

where Σ is a symmetric positive definite matrix


• Euclidean distance is Σ=I
Error rates on USPS digit recognition
• 7291 train, 2007 test
• Neural net: 0.049
• 1-NN/Euclidean distance: 0.055
• 1-NN/tangent distance: 0.026
• In practice, use neural net, since KNN too slow
(lazy learning) at test time

HTF 13.9
Problems with kNN
• Can be slow to find nearest nbr in high dim space
n∗ = arg min dist(x, xn )
n∈D
• Need to store all the training data, so takes a lot of
memory
• Need to specify the distance function
• Does not give probabilistic output
Why is probabilistic output useful?
• A classification function returns a single best guess
given an input ŷ(x, θ) ∈ Y
• A probabilistic classifier returns a probability
distribution over outputs given an input p(y|x, θ) ∈ [0, 1]
• If p(y|x) is near 0.5 (very uncertain), the system
may choose not to classify as 0/1 and instead ask
for human help
?

• If we want to combine different predictions p(y|x),


we need a measure of confidence
• p(y|x) lets us use likelihood as a measure of fit
Probabilistic kNN
• We can compute the empirical distribution over
labels in the K-neighborhood
• However, this will often predict 0 probability due to
sparse data
1 
p(y|x, D) = I(y = yj )
K
j∈nbr(x,K,D)
K=4, C=3

P = [3/4, 0, 1/4]

y=1 y=2 y=3


Probabilistic kNN
train P(y=1|x, D)
p(y=1|x,K=10,naive)
8.22 1
5

6.88 0.9
4
1 5.54 0.8

3 4.20 0.7

2 3 2.86 0.6

1.53 0.5
1
0.19 0.4

0 -1.15 0.3

-1
2 -2.49 0.2

-3.83 0.1
-2
-3 -2 -1 0 1 2 3 0
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24

8.22
P(y=2|x, D)
p(y=2|x,K=10,naive)
1 P(y=3|x, D)
p(y=3|x,K=10,naive)
8.22 1
6.88 0.9
6.88 0.9
5.54 0.8
5.54 0.8
4.20 0.7
4.20 0.7

2.86 0.6
2.86 0.6

1.53 0.5
1.53 0.5

0.19 0.4
0.19 0.4

-1.15 0.3 0.3


-1.15

-2.49 0.2 -2.49 0.2

-3.83 0.1 -3.83 0.1

0 0
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24
Heatmap of p(y|x,D) for a 2D grid
p(y=1|x,K=10,naive)
8.22 1

6.88 0.9

5.54 0.8

4.20 0.7

2.86 0.6

1.53 0.5

0.19 0.4

-1.15 0.3

-2.49 0.2

-3.83 0.1

0
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24

xrange = -4.5:0.1:6.25; yrange = -3.85:0.1:8.25;


[X Y] = meshgrid(xrange, yrange); XtestGrid = [X(:) Y(:)];
%[XtestGrid, xrange, yrange] = makeGrid2d(Xtrain, 0.4);
[ypredGrid, yprobGrid] = knnClassify(Xtrain, ytrain, XtestGrid, K);
HH = reshape(yprobGrid(:,1), [length(yrange) length(xrange)]);
figure(3);clf
imagesc(HH); axis xy; colorbar
imagesc, bar3, surf, contour
p(y=1|x,K=10,naive)
8.22 1

0.9 45
6.88

5.54 0.8 40

4.20 0.7
35

2.86 0.6
1 30
1.53 0.5
0.8
25
0.19 0.4
0.6
20
-1.15 0.3 0.4

-2.49 0.2 15
0.2 40
-3.83 0.1 0 30 10
0 20
0 10
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 20 10 5
30
40
50
60

50 1
0. 3
0.4
00..65

0.2 1
0.9

45 0.9 1
0.8

0.7
40 0.1 0.8 0.9
1

0.
0.

5
0. 20.4
0.

35 0.8
3

0.7
0. 0.91
0.4

30 1
0.8 0.7
0.03.40.7 1 0.6
25 0.2 00.5
.6 0.9
0.6
0.8 0.5
0. 7
0.1

20 0
0. 3.4 1 0.5
0.4 60
15 0.2 0.8
0.4
0.3
10 0.6
40 0.3
0. 1

0.2 0.4
5
0.2
0.2
0.1 20
5 10 15 20 25 30 35 40 45 0.1
0
0 10 20 0
30 40 0
50
Smoothing empirical frequencies
• The empirical distribution will often predict 0
probability due to sparse data
• We can add pseudo counts to the data and then
normalize

K=4, C=3

P = [3 + 1, 0 + 1, 1 + 1] / 7 = [4/7, 1/7, 2/7]

y=1 y=2 y=3


Softmax (multinomial logit) function
• We can “soften” the empirical distribution so it
spreads its probability mass over unseen classes
• Define the softmax with inverse temperature β
exp(βxi )
S(x, β)i = 
j exp(βxj )
• Big beta = cool temp = spiky distribution
• Small beta = high temp = uniform distribution
β=100 β =1 β=0.01
1 1 0.4

0.3

0.5 0.5 0.2

0.1
y=1 y=2 y=3
0 0 0
1 2 3 1 2 3 1 2 3

X = [3 0 1]
Softened Probabilistic kNN
train 8.22
Raw countsp(y=1|x,K=10,naive)
1

5 6.88 0.9
1.0 0.8
4
1 5.54

4.20 0.7

3 0.6
2.86

2 3 1.53 0.5

0.4
0.19
1
-1.15 0.3

0 -2.49 0.2

-1
2 -3.83
0.0 0.1

0
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24
-2
-3 -2 -1 0 1 2 3
Softmax
p(y=1|x,K=10,unweighted,beta=1.0000)
8.22

Sum over Knn 6.88


0.55

p(y|x, D, K, β) = 5.54 0.55 0.5



exp[(β/K) j∼x I(y = yj )] 4.20
0.45
  2.86

y′ exp[(β/K) j∼x I(y = yj )]



1.53
0.4

0.19 0.35

-1.15
0.3
-2.49

-3.83
0.25 0.25

-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24


Weighted Probabilistic kNN
Weighted
train p(y=1|x,K=10,weighted,beta=1.0000)
8.22
5
0.9
6.88
0.33 1.0
4
1 5.54
0.95
0.8

0.7
4.20
3
2.86 0.6

2 3 1.53 0.5

1 0.19 0.4

-1.15 0.3
0
0.0
-1
2 -2.49

-3.83
0.0
0.33
0.2

0.1

-2 -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24


-3 -2 -1 0 1 2 3
Softmax
p(y=1|x,K=10,unweighted,beta=1.0000)
8.22
Weighted sum over Knn 6.88
0.55

p(y|x, D, K, β) = 5.54 0.55 0.5

 4.20

exp[(β/K) j∼x w(x, xj )I(y = yj )] 2.86


0.45

  1.53
0.4

y′ exp[(β/K) j∼x w(x, xj )I(y = yj )]



0.19 0.35

-1.15
0.3
-2.49
0.25
Local kernel function -3.83 0.25

-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24


Kernel functions
Any smooth function K such that
K (x ) ≥ 0 , ∫ K ( x )dx = 1, ∫ xK ( x )dx = 0 and ∫ x 2 K ( x )dx > 0
• Epanechnikov quadratic kernel
 x − x0  3
( )
Kλ ( x0 , x) = D D(t ) ={
1−t 2 if t ≤1;
 4

 λ λ = bandwidth
0 otherwise.

• tri-cube kernel
 x − x0  (1− t )
3 3

Kλ (x0 , x) = D D(t ) ={


if t ≤1;


 λ
0 otherwise.

• Gaussian kernel

( x − x0 ) 2
K λ (x0 , x ) =
1
exp(− )
2π λ 2λ2

Kernel Compact support – vanishes beyond a finite range (Epanechnikov, tri-cube)


characteristics Everywhere differentiable (Gaussian, tri-cube)
HTF 6.2
Kernel functions on structured objects
• Rather than defining a feature vector x, and
computing Euclidean distance D(x, x’), sometimes
we can directly compute distance between two
structured objects
• Eg string/graph matching using dynamic
programming

S-ar putea să vă placă și