Machine Learning Lecture 4 1206897877279980 2 PDF

Lecture No.
Ravi Gupta
AU-KBC Research Centre,
MIT Campus, Anna University
Date: 12.3.2008
Todays Agenda
Instance Based Learning

Measuring Credibility of Classifier
Discussions on Case Study
WEKA Demo
Instance-Based Learning
Learning in Instance-based learning algorithms consists of simply

storing the presented training data and find a set of similar related
instances, retrieved from memory (training set) and used to classify
the new query instance.
Instance-Based Learning
Query Instance Target Class

Training
Instances
Present in memory
Previous Learning Method
Training Model / Target Class

Instances Hypothesis
Query Instance
Comparison
Instance-based learning Previous learning
In Memory
Training Model /
Instances Hypothesis
Every time a Hypothesis is

Hypothesis new hypothesis same for all
is generated future examples
K-nearest Neighbourhood
Learning
The most basic instance-based method.
This algorithm assumes all instances correspond to points in the

n-dimensional space Rn. The nearest neighbors of an instance are
defined in terms of the standard Euclidean distance.
Other metrics such as: Mahalanobis, Rank-based, Correlation-

based .. Can also be used.
Euclidian Distance
The Euclidean distance between points P = (p1, p2, ..,pn) and Q=(q1, q2,
, qn), in Euclidean n-space, is defined as:
n
( p1 q1 ) + ( p2 q2 ) + " + ( pn qn ) =
2 2 2
( pi qi ) 2
i =1
One dimension:
( p1 q1 ) 2 = p1 q1
Two dimensions: ( p1 q1 ) 2 + ( p2 q2 ) 2
Manhattan Distance
The taxicab metric is also known as rectilinear distance, L1 distance,

city block distance, or Manhattan distance, with corresponding
variations in the name of the geometry.
The Manhattan distance between points P = (p1, p2, ..,pn) and Q=(q1,
q2, , qn), in Euclidean n-space, is defined as
p1 q1 + p2 q2 + " + pn qn
Manhattan Distance
Manhattan distance versus Euclidean distance: The red, blue, and yellow
lines have the same length (12) in both Euclidean and taxicab geometry. In
Euclidean geometry, the green line has length 62 8.48, and is the unique
shortest path. In taxicab geometry, the green line's length is still 12, making
it no shorter than any other path shown.
Mahalanobis Distance
Prasanta Chandra Mahalanobis (Bangla: pn nd
) (June 29, 1893June 28, 1972) was an
Indian scientist and applied statistician.
His most important contributions are related to

large scale sample surveys. He introduced the
concept of pilot surveys and advocated the
usefulness of sampling methods.
He is best known for the Mahalanobis distance, a

statistical measure.
He founded the Indian Statistical Institute.
On his birthday, 29 June, we celebrate our

National Statistical Day.
P.C. Mahalanobis
It is based on correlations between variables by which different

patterns can be identified and analyzed. It is a useful way of
determining similarity of an unknown sample set to a known one.
It differs from Euclidean distance in that it takes into account the

correlations of the data set and is scale-invariant, i.e. not dependent on
the scale of measurements.
The Mahalanobis distance from a group of values with

mean = ( 1 , 2 ," , n )
and covariance matrix for a multivariate vector X = (x1, x2, ..,xn) is defined as:
T 1
D M = ( X ) ( X )
Covariance Matrix
E[( x1 1 )( x1 1 )] E[( x1 1 )( x2 2 )] E[( x1 1 )( x3 3 )] " E[( x1 1 )( xn n )]

E [( x )( x )] E [( x2 2 )( x2 2 )] E [( x2 2 )( x3 3 )] " E[( x2 2 )( xn n )]
2 2 1 1
= # # # % #

# # # #
E[( xn n )( x1 1 )] E[( xn n )( x2 2 )] E[( xn n )( x3 3 )] " E [( xn n )( xn n )]
( )
i , j = E ( xi i ) x j j

K-nearest Neighbor Learning
In nearest-neighbor learning the target function may be either discrete-
valued or real-valued.
Let us first consider learning discrete-valued target functions of the

form f : Rn V, where V is the finite set {vl, . . . vs}.
1-nearest Neighbor Learning
Xq is positive
5-nearest Neighbor Learning
Xq is negative
K-nearest Neighbor Learning
The k-NEAREST NEIGHBOR algorithm is easily adapted to
approximating continuous-valued target functions.
Distance-Weighted Nearest
Neighbor Learning
One refinement to the k-NEAREST NEIGHBOR algorithm is to
weight the contribution of each of the k neighbors according to their
distance to the query point xn, giving greater weight to closer
neighbors.
Neighbor Learning
For discrete
Output
For continuous
Output
Neighbor Learning
The variants of the k-NEAREST NEIGHBOR algorithm consider only the
k nearest neighbors to classify the query point. Once we add distance
weighting, there is really no harm in allowing all training examples to
have an influence on the classification of the xq, because very distant
examples will have very little effect on f(xq).
The only disadvantage of considering all examples is that our classifier

will run more slowly. If all training examples are considered when
classifying a new query instance, we call the algorithm a global method. If
only the nearest training examples are considered, we call it a local
method.
Remarks on k-NEAREST
NEIGHBOR Algorithm
K-NN does not perform explicit generalization. [For every

new instance a hypothesis has to be generated]
Another problem is selection of parameter K. One way to

find optimal k is to perform cross-validation for various k
and then select the one which give best performance.
NEIGHBOR Algorithm
One practical issue in applying k-NEAREST NEIGHBOR algorithms

is that the distance between instances is calculated based on all
attributes of the instance (i.e., on all axes in the Euclidean space
containing the instances). This lies in contrast to methods such as
decision tree learning systems that select only a subset of the instance
attributes when forming the hypothesis.
In Decision Tree
NEIGHBOR Algorithm
Consider applying k-NEAREST NEIGHBOR to a problem in which each

instance is described by 20 attributes, but where only 2 of these attributes are
relevant to determining the classification for the particular target function.
In this case, instances that have identical values for the 2 relevant attributes
may nevertheless be distant from one another in the 20-dimensional instance
space. As a result, the similarity metric used by k-NEAREST NEIGHBOR--
depending on all 20 attributes-will be misleading.
The distance between neighbors will be dominated by the large number of

irrelevant attributes. This difficulty, which arises when many irrelevant
attributes are present, is sometimes referred to as the curse of dimensionality.
Nearest-neighbor approaches are especially sensitive to this problem.
NEIGHBOR Algorithm
One interesting approach to overcoming this problem is to weight each

attribute differently when calculating the distance between two
instances. This corresponds to stretching the axes in the Euclidean
space, shortening the axes that correspond to less relevant attributes,
and lengthening the axes that correspond to more relevant attributes.
The amount by which each axis should be stretched can be determined
automatically using a cross-validation approach.
An even more drastic alternative is to completely eliminate the least

relevant attributes from the instance space. This is equivalent to setting
some of the zi scaling factors to zero. [Feature Selection]
NEIGHBOR Algorithm
K-NN is slow for large training dataset, because entire dataset must be
searched to make decision. To avoid this efficient memory indexing
must be used.
Various methods have been developed for indexing the stored training
examples so that the nearest neighbors can be identified more
efficiently at some additional cost in memory. One such indexing
method is the kd-tree (Bentley 1975; Friedman et al. 1977), in which
instances are stored at the leaves of a tree, with nearby instances stored
at the same or nearby nodes. The internal nodes of the tree sort the new
query xq, to the relevant leaf by testing selected attributes of xq.
Measuring Credibility
Confusion Matrix
Predicted Class
Class1 Class2
Class1 True positive False Negative
Actual (TP) (FN)
Class Class2 False Positive True Negative
(FP) (TN)
ACC = (TP+TN) / (TP+TN+FP+FN)
Sensitivity = TP / (TP+FN)
Specificity = TN / (TN+FP)
Recall/Precision
Cost Sensitive Cases
Other Important Measures
Measuring Performance
Hold out strategy

Cross-validation
LOOCV (Leave one out cross-validation)
Bootstrap Method
Hold-out Strategy
Training
Dataset
Complete Dataset
Test
Dataset
4-Fold Cross-validation
ACC1
Test Dataset
Training Dataset
ACC2
Test Dataset
Training Dataset
ACC3
Test Dataset
Training Dataset
ACC4
Test Dataset
Training Dataset
ACC = (ACC1 + ACC2 + ACC3 + ACC4) / 4

LOOCV (Leave-one-out cross-
validation)
LOOCV is simply n-fold cross-validation, where n is the number of

examples in the dataset
1 2 3 4 5
6 7 8 9 10
n = 20
11 12 13 14 15
16 17 18 19 20
LOOCV (Leave-one-out cross-
validation)
1 2 3 4 5 1 2 3 4 5
6 7 8 9 10 6 7 8 9 10
Training
11 12 13 14 15 dataset 11 12 13 14 15
16 17 18 19 20 16 17 18 19
Test
dataset 20
Bootstrap Method
To sample dataset with replacement to form training dataset.
Complete
Training
Dataset
dataset
(size = n)
(size = n)
Test
Dataset
(size = n1)
Bootstrap Method
To sample dataset with replacement to form training dataset.
Probability that
a sample does
not selected for
training dataset
Error Rate

Machine Learning Lecture 4 1206897877279980 2 PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Machine Learning Lecture 4 1206897877279980 2 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture No.

Instance Based Learning

Learning in Instance-based learning algorithms consists of simply

Query Instance Target Class

Training Model / Target Class

Every time a Hypothesis is

This algorithm assumes all instances correspond to points in the

Other metrics such as: Mahalanobis, Rank-based, Correlation-

The taxicab metric is also known as rectilinear distance, L1 distance,

His most important contributions are related to

He is best known for the Mahalanobis distance, a

He founded the Indian Statistical Institute.

On his birthday, 29 June, we celebrate our

It is based on correlations between variables by which different

It differs from Euclidean distance in that it takes into account the

The Mahalanobis distance from a group of values with

E[( x1 1 )( x1 1 )] E[( x1 1 )( x2 2 )] E[( x1 1 )( x3 3 )] " E[( x1 1 )( xn n )]

Let us first consider learning discrete-valued target functions of the

The only disadvantage of considering all examples is that our classifier

K-NN does not perform explicit generalization. [For every

Another problem is selection of parameter K. One way to

One practical issue in applying k-NEAREST NEIGHBOR algorithms

Consider applying k-NEAREST NEIGHBOR to a problem in which each

The distance between neighbors will be dominated by the large number of

One interesting approach to overcoming this problem is to weight each

An even more drastic alternative is to completely eliminate the least

ACC = (TP+TN) / (TP+TN+FP+FN)

Hold out strategy

ACC = (ACC1 + ACC2 + ACC3 + ACC4) / 4

LOOCV is simply n-fold cross-validation, where n is the number of

To sample dataset with replacement to form training dataset.

To sample dataset with replacement to form training dataset.

S-ar putea să vă placă și