Documente Academic
Documente Profesional
Documente Cultură
Lecture 4
K-nearest neighbors
Nearest neighbor classifier
• Remember all the training data (non-parametric
classifier)
• At test time, find closest example in training set,
and return corresponding label
ŷ(x) = yn∗ where n∗ = arg min dist(x, xn )
n∈D
?
K-nearest neighbor (kNN)
• We can find the K nearest neighbors, and return
the majority vote of their labels
• Eg y(X1) = x, y(X2) = o
Effect of K
• K yields smoother predictions, since we average
over more data
• K=1 yields y=piecewise constant labeling
• K = N predicts y=globally constant (majority) label
K=1 K=15
DHS 4.13
Model selection
• Degrees of freedom ≈ N/K, since if neighborhoods
don’t overlap, there would be N/K n’hoods, with one
label (parameter) each
• K=1 yields zero training error, but badly overfits
K=20 K=1
Test error
error
Train error
dof=5 dof=100
Model selection
N
valid
1
E[err] ≈ I(ŷ(xn ) = yn )
Nvalid n=1
K-fold cross validation
If D is so small that Nvalid would be an unreliable
estimate of the generalization error, we can
repeatedly train on all-but-1/K and test on 1/K’th.
Typically K=10.
If K=N-1, this is called leave-one-out-CV.
CV for kNN
• In hw1, you will implement CV and use it to select K
for a kNN classifier
• Can use the “one standard error” rule*, where we
pick the simplest model whose error is no more
than 1 se above the best.
• For KNN, dof=N/K, so we would pick K=11.
CV error
* HTF p216 K
Application of kNN to pixel labeling
LANDSAT images for an agricultural area in 4 spectral bands;
manual labeling into 7 classes (red soil, cotton, vegetation, etc.);
Output of 5NN using each 3x3 pixel block in all 4 channels (9*4=36 dimensions).
This approach outperformed all other methods in the STATLOG project.
Not on exam
Reducing space requirements of kNN
• Various heuristic algorithms have been proposed to
prune/ edit/ condense “irrelevant” points that are far
from the decision boundaries
• Later we will study sparse kernel machines that
give a more principled solution to this problem
Not on exam
Similarity is hard to define
“tufa”
“tufa”
“tufa”
Euclidean distance
• For real-valued feature vectors, we can use
Euclidean distance d
D(u, v)2 = ||u − v||2 = (u − v)T (u − v) = (ui − vi )2
i=1
HTF 13.9
Problems with kNN
• Can be slow to find nearest nbr in high dim space
n∗ = arg min dist(x, xn )
n∈D
• Need to store all the training data, so takes a lot of
memory
• Need to specify the distance function
• Does not give probabilistic output
Why is probabilistic output useful?
• A classification function returns a single best guess
given an input ŷ(x, θ) ∈ Y
• A probabilistic classifier returns a probability
distribution over outputs given an input p(y|x, θ) ∈ [0, 1]
• If p(y|x) is near 0.5 (very uncertain), the system
may choose not to classify as 0/1 and instead ask
for human help
?
P = [3/4, 0, 1/4]
6.88 0.9
4
1 5.54 0.8
3 4.20 0.7
2 3 2.86 0.6
1.53 0.5
1
0.19 0.4
0 -1.15 0.3
-1
2 -2.49 0.2
-3.83 0.1
-2
-3 -2 -1 0 1 2 3 0
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24
8.22
P(y=2|x, D)
p(y=2|x,K=10,naive)
1 P(y=3|x, D)
p(y=3|x,K=10,naive)
8.22 1
6.88 0.9
6.88 0.9
5.54 0.8
5.54 0.8
4.20 0.7
4.20 0.7
2.86 0.6
2.86 0.6
1.53 0.5
1.53 0.5
0.19 0.4
0.19 0.4
0 0
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24
Heatmap of p(y|x,D) for a 2D grid
p(y=1|x,K=10,naive)
8.22 1
6.88 0.9
5.54 0.8
4.20 0.7
2.86 0.6
1.53 0.5
0.19 0.4
-1.15 0.3
-2.49 0.2
-3.83 0.1
0
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24
0.9 45
6.88
5.54 0.8 40
4.20 0.7
35
2.86 0.6
1 30
1.53 0.5
0.8
25
0.19 0.4
0.6
20
-1.15 0.3 0.4
-2.49 0.2 15
0.2 40
-3.83 0.1 0 30 10
0 20
0 10
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 20 10 5
30
40
50
60
50 1
0. 3
0.4
00..65
0.2 1
0.9
45 0.9 1
0.8
0.7
40 0.1 0.8 0.9
1
0.
0.
5
0. 20.4
0.
35 0.8
3
0.7
0. 0.91
0.4
30 1
0.8 0.7
0.03.40.7 1 0.6
25 0.2 00.5
.6 0.9
0.6
0.8 0.5
0. 7
0.1
20 0
0. 3.4 1 0.5
0.4 60
15 0.2 0.8
0.4
0.3
10 0.6
40 0.3
0. 1
0.2 0.4
5
0.2
0.2
0.1 20
5 10 15 20 25 30 35 40 45 0.1
0
0 10 20 0
30 40 0
50
Smoothing empirical frequencies
• The empirical distribution will often predict 0
probability due to sparse data
• We can add pseudo counts to the data and then
normalize
K=4, C=3
0.3
0.1
y=1 y=2 y=3
0 0 0
1 2 3 1 2 3 1 2 3
X = [3 0 1]
Softened Probabilistic kNN
train 8.22
Raw countsp(y=1|x,K=10,naive)
1
5 6.88 0.9
1.0 0.8
4
1 5.54
4.20 0.7
3 0.6
2.86
2 3 1.53 0.5
0.4
0.19
1
-1.15 0.3
0 -2.49 0.2
-1
2 -3.83
0.0 0.1
0
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24
-2
-3 -2 -1 0 1 2 3
Softmax
p(y=1|x,K=10,unweighted,beta=1.0000)
8.22
0.19 0.35
-1.15
0.3
-2.49
-3.83
0.25 0.25
0.7
4.20
3
2.86 0.6
2 3 1.53 0.5
1 0.19 0.4
-1.15 0.3
0
0.0
-1
2 -2.49
-3.83
0.0
0.33
0.2
0.1
4.20
1.53
0.4
-1.15
0.3
-2.49
0.25
Local kernel function -3.83 0.25
• Gaussian kernel
( x − x0 ) 2
K λ (x0 , x ) =
1
exp(− )
2π λ 2λ2