Documente Academic
Documente Profesional
Documente Cultură
Ravi Gupta
AU-KBC Research Centre,
MIT Campus, Anna University
Date: 12.3.2008
Todays Agenda
Present in memory
Previous Learning Method
Query Instance
Comparison
Instance-based learning Previous learning
In Memory
Training Model /
Instances Hypothesis
The Euclidean distance between points P = (p1, p2, ..,pn) and Q=(q1, q2,
, qn), in Euclidean n-space, is defined as:
n
( p1 q1 ) + ( p2 q2 ) + " + ( pn qn ) =
2 2 2
( pi qi ) 2
i =1
One dimension:
( p1 q1 ) 2 = p1 q1
Two dimensions: ( p1 q1 ) 2 + ( p2 q2 ) 2
Manhattan Distance
The Manhattan distance between points P = (p1, p2, ..,pn) and Q=(q1,
q2, , qn), in Euclidean n-space, is defined as
p1 q1 + p2 q2 + " + pn qn
Manhattan Distance
Manhattan distance versus Euclidean distance: The red, blue, and yellow
lines have the same length (12) in both Euclidean and taxicab geometry. In
Euclidean geometry, the green line has length 62 8.48, and is the unique
shortest path. In taxicab geometry, the green line's length is still 12, making
it no shorter than any other path shown.
Mahalanobis Distance
Prasanta Chandra Mahalanobis (Bangla: pn nd
) (June 29, 1893June 28, 1972) was an
Indian scientist and applied statistician.
P.C. Mahalanobis
Mahalanobis Distance
and covariance matrix for a multivariate vector X = (x1, x2, ..,xn) is defined as:
T 1
D M = ( X ) ( X )
Covariance Matrix
( )
i , j = E ( xi i ) x j j
K-nearest Neighbor Learning
In nearest-neighbor learning the target function may be either discrete-
valued or real-valued.
Xq is positive
5-nearest Neighbor Learning
Xq is negative
K-nearest Neighbor Learning
The k-NEAREST NEIGHBOR algorithm is easily adapted to
approximating continuous-valued target functions.
Distance-Weighted Nearest
Neighbor Learning
One refinement to the k-NEAREST NEIGHBOR algorithm is to
weight the contribution of each of the k neighbors according to their
distance to the query point xn, giving greater weight to closer
neighbors.
Distance-Weighted Nearest
Neighbor Learning
For discrete
Output
For continuous
Output
Distance-Weighted Nearest
Neighbor Learning
The variants of the k-NEAREST NEIGHBOR algorithm consider only the
k nearest neighbors to classify the query point. Once we add distance
weighting, there is really no harm in allowing all training examples to
have an influence on the classification of the xq, because very distant
examples will have very little effect on f(xq).
In this case, instances that have identical values for the 2 relevant attributes
may nevertheless be distant from one another in the 20-dimensional instance
space. As a result, the similarity metric used by k-NEAREST NEIGHBOR--
depending on all 20 attributes-will be misleading.
K-NN is slow for large training dataset, because entire dataset must be
searched to make decision. To avoid this efficient memory indexing
must be used.
Various methods have been developed for indexing the stored training
examples so that the nearest neighbors can be identified more
efficiently at some additional cost in memory. One such indexing
method is the kd-tree (Bentley 1975; Friedman et al. 1977), in which
instances are stored at the leaves of a tree, with nearby instances stored
at the same or nearby nodes. The internal nodes of the tree sort the new
query xq, to the relevant leaf by testing selected attributes of xq.
Measuring Credibility
Confusion Matrix
Predicted Class
Class1 Class2
Class1 True positive False Negative
Actual (TP) (FN)
Class Class2 False Positive True Negative
(FP) (TN)
Sensitivity = TP / (TP+FN)
Specificity = TN / (TN+FP)
Recall/Precision
Cost Sensitive Cases
Other Important Measures
Measuring Performance
Training
Dataset
Complete Dataset
Test
Dataset
4-Fold Cross-validation
4-Fold Cross-validation
ACC1
Test Dataset
Training Dataset
4-Fold Cross-validation
ACC2
Test Dataset
Training Dataset
4-Fold Cross-validation
ACC3
Test Dataset
Training Dataset
4-Fold Cross-validation
ACC4
Test Dataset
Training Dataset
4-Fold Cross-validation
1 2 3 4 5
6 7 8 9 10
n = 20
11 12 13 14 15
16 17 18 19 20
LOOCV (Leave-one-out cross-
validation)
1 2 3 4 5 1 2 3 4 5
6 7 8 9 10 6 7 8 9 10
Training
11 12 13 14 15 dataset 11 12 13 14 15
16 17 18 19 20 16 17 18 19
Test
dataset 20
Bootstrap Method
Complete
Training
Dataset
dataset
(size = n)
(size = n)
Test
Dataset
(size = n1)
Bootstrap Method
Probability that
a sample does
not selected for
training dataset
Error Rate