Sunteți pe pagina 1din 44

Lecture No.

Ravi Gupta
AU-KBC Research Centre,
MIT Campus, Anna University

Date: 12.3.2008
Todays Agenda

Instance Based Learning


Measuring Credibility of Classifier
Discussions on Case Study
WEKA Demo
Instance-Based Learning

Learning in Instance-based learning algorithms consists of simply


storing the presented training data and find a set of similar related
instances, retrieved from memory (training set) and used to classify
the new query instance.
Instance-Based Learning

Query Instance Target Class


Training
Instances

Present in memory
Previous Learning Method

Training Model / Target Class


Instances Hypothesis

Query Instance
Comparison
Instance-based learning Previous learning

In Memory
Training Model /
Instances Hypothesis

Every time a Hypothesis is


Hypothesis new hypothesis same for all
is generated future examples
K-nearest Neighbourhood
Learning
The most basic instance-based method.

This algorithm assumes all instances correspond to points in the


n-dimensional space Rn. The nearest neighbors of an instance are
defined in terms of the standard Euclidean distance.

Other metrics such as: Mahalanobis, Rank-based, Correlation-


based .. Can also be used.
Euclidian Distance

The Euclidean distance between points P = (p1, p2, ..,pn) and Q=(q1, q2,
, qn), in Euclidean n-space, is defined as:

n
( p1 q1 ) + ( p2 q2 ) + " + ( pn qn ) =
2 2 2
( pi qi ) 2
i =1

One dimension:
( p1 q1 ) 2 = p1 q1

Two dimensions: ( p1 q1 ) 2 + ( p2 q2 ) 2
Manhattan Distance

The taxicab metric is also known as rectilinear distance, L1 distance,


city block distance, or Manhattan distance, with corresponding
variations in the name of the geometry.

The Manhattan distance between points P = (p1, p2, ..,pn) and Q=(q1,
q2, , qn), in Euclidean n-space, is defined as

p1 q1 + p2 q2 + " + pn qn
Manhattan Distance

Manhattan distance versus Euclidean distance: The red, blue, and yellow
lines have the same length (12) in both Euclidean and taxicab geometry. In
Euclidean geometry, the green line has length 62 8.48, and is the unique
shortest path. In taxicab geometry, the green line's length is still 12, making
it no shorter than any other path shown.
Mahalanobis Distance
Prasanta Chandra Mahalanobis (Bangla: pn nd
) (June 29, 1893June 28, 1972) was an
Indian scientist and applied statistician.

His most important contributions are related to


large scale sample surveys. He introduced the
concept of pilot surveys and advocated the
usefulness of sampling methods.

He is best known for the Mahalanobis distance, a


statistical measure.

He founded the Indian Statistical Institute.

On his birthday, 29 June, we celebrate our


National Statistical Day.

P.C. Mahalanobis
Mahalanobis Distance

It is based on correlations between variables by which different


patterns can be identified and analyzed. It is a useful way of
determining similarity of an unknown sample set to a known one.

It differs from Euclidean distance in that it takes into account the


correlations of the data set and is scale-invariant, i.e. not dependent on
the scale of measurements.
Mahalanobis Distance

The Mahalanobis distance from a group of values with


mean = ( 1 , 2 ," , n )

and covariance matrix for a multivariate vector X = (x1, x2, ..,xn) is defined as:

T 1
D M = ( X ) ( X )
Covariance Matrix

E[( x1 1 )( x1 1 )] E[( x1 1 )( x2 2 )] E[( x1 1 )( x3 3 )] " E[( x1 1 )( xn n )]


E [( x )( x )] E [( x2 2 )( x2 2 )] E [( x2 2 )( x3 3 )] " E[( x2 2 )( xn n )]
2 2 1 1
= # # # % #

# # # #
E[( xn n )( x1 1 )] E[( xn n )( x2 2 )] E[( xn n )( x3 3 )] " E [( xn n )( xn n )]

( )
i , j = E ( xi i ) x j j

K-nearest Neighbor Learning
In nearest-neighbor learning the target function may be either discrete-
valued or real-valued.

Let us first consider learning discrete-valued target functions of the


form f : Rn V, where V is the finite set {vl, . . . vs}.
1-nearest Neighbor Learning

Xq is positive
5-nearest Neighbor Learning

Xq is negative
K-nearest Neighbor Learning
The k-NEAREST NEIGHBOR algorithm is easily adapted to
approximating continuous-valued target functions.
Distance-Weighted Nearest
Neighbor Learning
One refinement to the k-NEAREST NEIGHBOR algorithm is to
weight the contribution of each of the k neighbors according to their
distance to the query point xn, giving greater weight to closer
neighbors.
Distance-Weighted Nearest
Neighbor Learning

For discrete
Output

For continuous
Output
Distance-Weighted Nearest
Neighbor Learning
The variants of the k-NEAREST NEIGHBOR algorithm consider only the
k nearest neighbors to classify the query point. Once we add distance
weighting, there is really no harm in allowing all training examples to
have an influence on the classification of the xq, because very distant
examples will have very little effect on f(xq).

The only disadvantage of considering all examples is that our classifier


will run more slowly. If all training examples are considered when
classifying a new query instance, we call the algorithm a global method. If
only the nearest training examples are considered, we call it a local
method.
Remarks on k-NEAREST
NEIGHBOR Algorithm

K-NN does not perform explicit generalization. [For every


new instance a hypothesis has to be generated]

Another problem is selection of parameter K. One way to


find optimal k is to perform cross-validation for various k
and then select the one which give best performance.
Remarks on k-NEAREST
NEIGHBOR Algorithm

One practical issue in applying k-NEAREST NEIGHBOR algorithms


is that the distance between instances is calculated based on all
attributes of the instance (i.e., on all axes in the Euclidean space
containing the instances). This lies in contrast to methods such as
decision tree learning systems that select only a subset of the instance
attributes when forming the hypothesis.
In Decision Tree
Remarks on k-NEAREST
NEIGHBOR Algorithm

Consider applying k-NEAREST NEIGHBOR to a problem in which each


instance is described by 20 attributes, but where only 2 of these attributes are
relevant to determining the classification for the particular target function.

In this case, instances that have identical values for the 2 relevant attributes
may nevertheless be distant from one another in the 20-dimensional instance
space. As a result, the similarity metric used by k-NEAREST NEIGHBOR--
depending on all 20 attributes-will be misleading.

The distance between neighbors will be dominated by the large number of


irrelevant attributes. This difficulty, which arises when many irrelevant
attributes are present, is sometimes referred to as the curse of dimensionality.
Nearest-neighbor approaches are especially sensitive to this problem.
Remarks on k-NEAREST
NEIGHBOR Algorithm

One interesting approach to overcoming this problem is to weight each


attribute differently when calculating the distance between two
instances. This corresponds to stretching the axes in the Euclidean
space, shortening the axes that correspond to less relevant attributes,
and lengthening the axes that correspond to more relevant attributes.
The amount by which each axis should be stretched can be determined
automatically using a cross-validation approach.

An even more drastic alternative is to completely eliminate the least


relevant attributes from the instance space. This is equivalent to setting
some of the zi scaling factors to zero. [Feature Selection]
Remarks on k-NEAREST
NEIGHBOR Algorithm

K-NN is slow for large training dataset, because entire dataset must be
searched to make decision. To avoid this efficient memory indexing
must be used.

Various methods have been developed for indexing the stored training
examples so that the nearest neighbors can be identified more
efficiently at some additional cost in memory. One such indexing
method is the kd-tree (Bentley 1975; Friedman et al. 1977), in which
instances are stored at the leaves of a tree, with nearby instances stored
at the same or nearby nodes. The internal nodes of the tree sort the new
query xq, to the relevant leaf by testing selected attributes of xq.
Measuring Credibility
Confusion Matrix

Predicted Class
Class1 Class2
Class1 True positive False Negative
Actual (TP) (FN)
Class Class2 False Positive True Negative
(FP) (TN)

ACC = (TP+TN) / (TP+TN+FP+FN)

Sensitivity = TP / (TP+FN)
Specificity = TN / (TN+FP)
Recall/Precision
Cost Sensitive Cases
Other Important Measures
Measuring Performance

Hold out strategy


Cross-validation
LOOCV (Leave one out cross-validation)
Bootstrap Method
Hold-out Strategy

Training
Dataset

Complete Dataset

Test
Dataset
4-Fold Cross-validation
4-Fold Cross-validation

ACC1

Test Dataset
Training Dataset
4-Fold Cross-validation

ACC2

Test Dataset
Training Dataset
4-Fold Cross-validation

ACC3

Test Dataset
Training Dataset
4-Fold Cross-validation

ACC4

Test Dataset
Training Dataset
4-Fold Cross-validation

ACC = (ACC1 + ACC2 + ACC3 + ACC4) / 4


LOOCV (Leave-one-out cross-
validation)

LOOCV is simply n-fold cross-validation, where n is the number of


examples in the dataset

1 2 3 4 5
6 7 8 9 10
n = 20
11 12 13 14 15
16 17 18 19 20
LOOCV (Leave-one-out cross-
validation)
1 2 3 4 5 1 2 3 4 5
6 7 8 9 10 6 7 8 9 10
Training
11 12 13 14 15 dataset 11 12 13 14 15
16 17 18 19 20 16 17 18 19

Test
dataset 20
Bootstrap Method

To sample dataset with replacement to form training dataset.

Complete
Training
Dataset
dataset
(size = n)
(size = n)

Test
Dataset
(size = n1)
Bootstrap Method

To sample dataset with replacement to form training dataset.

Probability that
a sample does
not selected for
training dataset

Error Rate

S-ar putea să vă placă și