Documente Academic
Documente Profesional
Documente Cultură
B. B. Misra
Eager: must commit to a single hypothesis that covers the entire instance space
Cons
Cost of classification can be high Uses all attributes (do not learn which are most important)
knn learning
Scaling issues
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example:
height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M
Important Decisions
Distance measure Value of k (usually odd) Voting mechanism Memory indexing
Euclidean Distance
Typically used for real valued attributes Instance x (often called a feature vector)
a1 ( x), a2 ( x), an ( x)
d ( xi , x j )
r 1
(ar ( xi ) ar ( x j ))2
(x ) f q
f ( xi )
i 1
Using the training data classify the test data for 1-NN, 3-NN, 5NN and 7-NN. Distance of 1st record of training data from test data is
d1 6 7
2
Training data
Number Lines Line types Rectangles Colors Mondrian? 1 6 1 10 4 No 2 4 2 8 5 No 3 5 2 7 4 Yes 4 5 1 8 4 Yes 5 5 1 10 5 No 6 6 1 8 6 Yes 7 7 1 14 5 No
1 2
10 9
4 4
Test data
Number Lines Line types Rectangles Colors Mondrian? 8 7 2 9 4 ? Number Mondrian? Distance from test data 1.732 1 No 3.317 2 No 2.828 3 Yes 4 Yes 2.450 2.646 5 No 2.646 6 Yes 5.196 7 No
3 1.732
For 1-NN, 1st record (class=No) is the closest i.e. the 1st neighbor, hence it classifies No. For 3-NN, and both equal distance} are neighbors. A tie occurred, to break it use certain mechanism, choose one randomly or let 1st priority to 1st neighbor and it classifies No. For 5-NN, 1st(No), 4th(Yes), 5th(No), 6th(Yes), and 3rd (Yes) are the neighbors in order. Hence it classifies Yes. For 7-NN, in this case only 7 records are there in training set so all are considered, there are 4 No and 3 Yes classes, hence it classifies No. 1st(No), 4th(Yes), {5th(No)& 6th(Yes)
Algorithm questions
What is the space complexity for the model? What is the time complexity for learning the model? What is the time complexity for classification of an instance?
Challenges Find a good similarity metric Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases