K Nearest Neighbor

k-nearest neighbor Instance-based learning
B. B. Misra
A classification of learning algorithms

Eager learning algorithms
Neural networks Decision trees Bayesian classifiers
Lazy learning algorithms

K-nearest neighbor Case based reasoning
Lazy vs. Eager Learning

Lazy learning (e.g., instance-based learning): Simply stores training data and waits until it is given a test tuple. It does not build models explicitly Eager learning : Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify
Lazy: less time in training but more time in predicting Accuracy Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function
Eager: must commit to a single hypothesis that covers the entire instance space
General Idea of Instance-based Learning

Learning: store all the data instances Performance:
when a new query instance is encountered
retrieve a similar set of related instances from memory use to classify the new query
Pros and Cons of Instance Based Learning

Pros
Can construct a different approximation to the target function for each distinct query instance to be classified Can use more complex, symbolic representations
Cons
Cost of classification can be high Uses all attributes (do not learn which are most important)
Example: 1-Nearest Neighbor
Example: 3-Nearest Neighbor
k-nearest neighbor (knn) learning

Most basic type of instance learning Assumes all instances are points in n-dimensional space A distance measure is needed to determine the closeness of instances Classify an instance by finding its nearest neighbors and picking the most popular class among the neighbors Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes To overcome it, axes stretch or elimination of the least relevant attributes
knn learning
Scaling issues
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example:
height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M
Important Decisions
Distance measure Value of k (usually odd) Voting mechanism Memory indexing
Euclidean Distance
Typically used for real valued attributes Instance x (often called a feature vector)
a1 ( x), a2 ( x), an ( x)
Distance between two instances xi and xj

n
d ( xi , x j )
r 1
(ar ( xi ) ar ( x j ))2
Discrete Valued Target Function

Training algorithm: For each training example x, f(x), add the example to the list training_examples Classification algorithm: Given a query instance xq to be classified Let x1xk be the k training examples nearest k to xq ( x ) arg max f (v, f ( xi )) q v V Return i 1
where (a, b) 1 if a b (a, b) 0 otherwise
Continuous valued target function

Algorithm computes the mean value of the k nearest training examples rather than the most common value Replace fine line in previous algorithm with
k
(x ) f q
f ( xi )
i 1
Using the training data classify the test data for 1-NN, 3-NN, 5NN and 7-NN. Distance of 1st record of training data from test data is
d1 6 7
2
Training data
Number Lines Line types Rectangles Colors Mondrian? 1 6 1 10 4 No 2 4 2 8 5 No 3 5 2 7 4 Yes 4 5 1 8 4 Yes 5 5 1 10 5 No 6 6 1 8 6 Yes 7 7 1 14 5 No
1 2
10 9
4 4
Test data
Number Lines Line types Rectangles Colors Mondrian? 8 7 2 9 4 ? Number Mondrian? Distance from test data 1.732 1 No 3.317 2 No 2.828 3 Yes 4 Yes 2.450 2.646 5 No 2.646 6 Yes 5.196 7 No
3 1.732
For 1-NN, 1st record (class=No) is the closest i.e. the 1st neighbor, hence it classifies No. For 3-NN, and both equal distance} are neighbors. A tie occurred, to break it use certain mechanism, choose one randomly or let 1st priority to 1st neighbor and it classifies No. For 5-NN, 1st(No), 4th(Yes), 5th(No), 6th(Yes), and 3rd (Yes) are the neighbors in order. Hence it classifies Yes. For 7-NN, in this case only 7 records are there in training set so all are considered, there are 4 No and 3 Yes classes, hence it classifies No. 1st(No), 4th(Yes), {5th(No)& 6th(Yes)
Algorithm questions
What is the space complexity for the model? What is the time complexity for learning the model? What is the time complexity for classification of an instance?
Analysis of KNN Algorithm

Advantages of KNN Algorithm KNN is an easy to understand and easy to implement classification technique. It can perform well in many situations. Cover and Hart show that the error of the nearest neighbor rule is bounded above by twice the Bayes error under certain reasonable assumptions. Also, the error of the general KNN method asymptotically approaches that of the Bayes error and can be used to approximate it. KNN is particularly well suited for multi-modal classes as well as applications in which an object can have many class labels. Disadvantages of KNN Algorithm The naive version of the algorithm is easy to implement by computing the distances from the test sample to all stored vectors, but it is computationally intensive, especially when the size of the training set grows.
Case-Based Reasoning (CBR)
CBR: Uses a database of problem solutions to solve new problems

Store symbolic description (tuples or cases)not points in a Euclidean space Applications: Customer-service (product-related diagnosis), legal ruling Methodology Instances represented by rich symbolic descriptions (e.g., function graphs) Search for similar cases, multiple retrieved cases may be combined Tight coupling between case retrieval, knowledge-based reasoning, and problem solving
Challenges Find a good similarity metric Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases

K Nearest Neighbor

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

K Nearest Neighbor

Încărcat de

Drepturi de autor:

Formate disponibile

k-nearest neighbor Instance-based learning

A classification of learning algorithms

Lazy learning algorithms

Lazy vs. Eager Learning

General Idea of Instance-based Learning

Pros and Cons of Instance Based Learning

Example: 1-Nearest Neighbor

Example: 3-Nearest Neighbor

k-nearest neighbor (knn) learning

Distance between two instances xi and xj

Discrete Valued Target Function

Continuous valued target function

Analysis of KNN Algorithm

Case-Based Reasoning (CBR)

CBR: Uses a database of problem solutions to solve new problems

S-ar putea să vă placă și