Sunteți pe pagina 1din 177

Pattern Recognition and Neural Networks

P.S.Sastry
sastry@ee.iisc.ernet.in

Dept. Electrical Engineering


Indian Institute of Science, Bangalore

PRNN Jan-Apr 2016 p.1/177

Reference Books

R.O.Duda, P.E.Hart and D.G.Stork, Pattern


Classification, Johy Wiley, 2002

C.M.Bishop, Pattern Recognition and Machine


Learning, Springer, 2006.

T.Hastie, R.Tibshirani and J.Friedman The Elements


of Statistical Learning: Data Mining, Inference and
Prediction, Springer, 2009.

Kevin Murphy, Machine Learning: A Probabilistic


Perspective, MIT Press, 2012.

R.O.Duda and P.E. Hart, Pattern Classification and


Scene Analysis, Wiley, 1973
PRNN Jan-Apr 2016 p.2/177

Pattern Recognition
A basic attribute of people categorisation of sensory
input
Pattern PR System Class label

PRNN Jan-Apr 2016 p.3/177

Pattern Recognition
A basic attribute of people categorisation of sensory
input
Pattern PR System Class label
Examples of Pattern Recognition tasks

Reading facial expressions

Recognising Speech

Reading a Document

Identifying a person by fingerprints

Diagnosis from medical images

Wine tasting

PRNN Jan-Apr 2016 p.4/177

Machine Recognition of Patterns


X

Pattern feature extractor classifier class label

PRNN Jan-Apr 2016 p.5/177

Machine Recognition of Patterns


X

Pattern feature extractor classifier class label

Feature extractor makes some measurements on the


input pattern.

X is called Feature Vector. Often, X n .

Classifier maps each feature vector to a class label.

Features to be used are problem-specific.

PRNN Jan-Apr 2016 p.6/177

Some Examples of PR Tasks

Character Recognition
Pattern Image.
Class identity of character
Features: Binary image, projections (e.g., row and
column sums), Moments etc.

PRNN Jan-Apr 2016 p.7/177

Examples Contd.

Speech Recognition
Pattern 1-D signal (or its sampled version)
Class identity of speech units
Features LPC model of chunks of speech,
spectral info, cepstrum etc.
Pattern can become a sequence of feature vectors.

PRNN Jan-Apr 2016 p.8/177

Examples contd...

Document Classification (e.g., spam detection)


Pattern A document
Class The type of document
Features word occurrence counts, word context
etc.

PRNN Jan-Apr 2016 p.9/177

Examples contd...

Document Classification (e.g., spam detection)


Pattern A document
Class The type of document
Features word occurrence counts, word context
etc.
Many other applications:
Biometrics-based authentication,
Video Surveillance,
Credit Screening,
Imposter Detection, diagnostics of machinery etc.
PRNN Jan-Apr 2016 p.10/177

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant


quantities.

PRNN Jan-Apr 2016 p.11/177

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant


quantities.

Should we design features or learn them?

PRNN Jan-Apr 2016 p.12/177

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant


quantities.

Should we design features or learn them?

Some techniques available to extract more relevant


quantities from the initial measurements. (e.g., PCA)

PRNN Jan-Apr 2016 p.13/177

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant


quantities.

Should we design features or learn them?

Some techniques available to extract more relevant


quantities from the initial measurements. (e.g., PCA)

After feature extraction each pattern is a vector

Classifier is a function to map such vectors into class


labels.

PRNN Jan-Apr 2016 p.14/177

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant


quantities.

Should we design features or learn them?

Some techniques available to extract more relevant


quantities from the initial measurements. (e.g., PCA)

After feature extraction each pattern is a vector

Classifier is a function to map such vectors into class


labels.
Many general techniques of classifier design are
available.
This course is about techniques for obtaining
classifiers.

PRNN Jan-Apr 2016 p.15/177

Some notation

Feature Space, X Set of all possible feature vectors.

PRNN Jan-Apr 2016 p.16/177

Some notation

Feature Space, X Set of all possible feature vectors.

Classifier: a decision rule or a function,


h : X {1, , M }.

Often, X = n . Convenient to take M = 2.


Then we take the labels as {0, 1} or {1, 1}.

PRNN Jan-Apr 2016 p.17/177

Some notation

Feature Space, X Set of all possible feature vectors.

Classifier: a decision rule or a function,


h : X {1, , M }.

Often, X = n . Convenient to take M = 2.


Then we take the labels as {0, 1} or {1, 1}.

Then, any binary valued function on X is a classifier.


What h to choose? We want correct or optimal
classifier.
How to design classifiers?
How to judge performance?
How to provide performance guarentees?
PRNN Jan-Apr 2016 p.18/177

We first consider the 2-class problem.

PRNN Jan-Apr 2016 p.19/177

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

PRNN Jan-Apr 2016 p.20/177

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

Simplest alternative: design M 2-class classifiers.


One Vs Rest

PRNN Jan-Apr 2016 p.21/177

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

Simplest alternative: design M 2-class classifiers.


One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.

PRNN Jan-Apr 2016 p.22/177

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

Simplest alternative: design M 2-class classifiers.


One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.
The 2-class problem is the basic problem.

PRNN Jan-Apr 2016 p.23/177

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

Simplest alternative: design M 2-class classifiers.


One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.
The 2-class problem is the basic problem.

We will also look at M-class classifiesrs.

PRNN Jan-Apr 2016 p.24/177

A simple PR problem

: Problem: Spot the Right Candidate

: Features:
x1 : Marks based on academic record
x2 : Marks in the interview

PRNN Jan-Apr 2016 p.25/177

A simple PR problem

: Problem: Spot the Right Candidate

: Features:
x1 : Marks based on academic record
x2 : Marks in the interview

A Classifier: ax1 + bx2 > c Good

PRNN Jan-Apr 2016 p.26/177

A simple PR problem

: Problem: Spot the Right Candidate

: Features:
x1 : Marks based on academic record
x2 : Marks in the interview

A Classifier: ax1 + bx2 > c Good


Another Classifier: x1 x2 > c Good
(or (x1 + a)(x2 + b) > c).

PRNN Jan-Apr 2016 p.27/177

A simple PR problem

: Problem: Spot the Right Candidate

: Features:
x1 : Marks based on academic record
x2 : Marks in the interview

A Classifier: ax1 + bx2 > c Good


Another Classifier: x1 x2 > c Good
(or (x1 + a)(x2 + b) > c).

Design of classifier:
We have to choose a specific form for the classifier.
What values to use for parameters such as a, b, c?
PRNN Jan-Apr 2016 p.28/177

Designing Classifiers

Need to decide how feature vector values determine


the class.
(How different marks reflect goodness of candidate)

PRNN Jan-Apr 2016 p.29/177

Designing Classifiers

Need to decide how feature vector values determine


the class.
(How different marks reflect goodness of candidate)

In most applications, not possible to design classifier


from physics of the problem.

PRNN Jan-Apr 2016 p.30/177

Designing Classifiers

Need to decide how feature vector values determine


the class.
(How different marks reflect goodness of candidate)

In most applications, not possible to design classifier


from physics of the problem.

The difficulties are


Lot of variability in patterns of a single class
Variability in feature vector values
Feature vectors of patterns from different classes
can be arbitrarily close.
Noise in measurements
PRNN Jan-Apr 2016 p.31/177

Designing Classifiers contd...

Often the only information available for the design is


A training set of example patterns.

Training set: {(Xi , yi ), i = 1, . . . , }.


Here Xi is an example feature vector of class yi .

PRNN Jan-Apr 2016 p.32/177

Designing Classifiers contd...

Often the only information available for the design is


A training set of example patterns.

Training set: {(Xi , yi ), i = 1, . . . , }.


Here Xi is an example feature vector of class yi .

Generation of training set Take representative


patterns of known category.

Now learn an appropriate function h as classifier.

Test and validate the classifier on more data.

PRNN Jan-Apr 2016 p.33/177

Training Set

PRNN Jan-Apr 2016 p.34/177

Another example problem

Problem: recognize persons of medium build

Features: Height and Weight

The classifier is nonlinear here


PRNN Jan-Apr 2016 p.35/177

Function Learning

Closely related problem. Output is continuous-valued


rather than discrete as in classifiers.
Here training set of examples could be
{(Xi , yi ), i = 1, . . . , }, Xi X , yi .

PRNN Jan-Apr 2016 p.36/177

Function Learning

Closely related problem. Output is continuous-valued


rather than discrete as in classifiers.
Here training set of examples could be
{(Xi , yi ), i = 1, . . . , }, Xi X , yi .

The prediction variable, y , is continuous; rather than


taking finitely many values. ( There can be noise in
examples).

Similar learning techniques needed to infer the


underlying functional relationship between X and y .
(Regression function of y on X ).
PRNN Jan-Apr 2016 p.37/177

Examples of Function Learning

Time series prediction: Given a series x1 , x2 , , find


a function to predict xn .

PRNN Jan-Apr 2016 p.38/177

Examples of Function Learning

Time series prediction: Given a series x1 , x2 , , find


a function to predict xn .

Based on past values: Find a best function

xn = h(xn1 , xn2 , , xnp )

Predict stock prices, exchange rates etc.


Linear prediction model used in speech analysis

PRNN Jan-Apr 2016 p.39/177

Examples of Function Learning

Time series prediction: Given a series x1 , x2 , , find


a function to predict xn .

Based on past values: Find a best function

xn = h(xn1 , xn2 , , xnp )

Predict stock prices, exchange rates etc.


Linear prediction model used in speech analysis

More general predictors can use other variables also.


Predict rainfall based on measurements and
(possibly) previous years data.
In general, System Identification. (An application:
smart sensors)
PRNN Jan-Apr 2016 p.40/177

Examples contd... : Equaliser


Tx x(k) channel Z(k) filter y(k) Rx
We want y(k) = x(k). Design (or adapt) the filter to
achieve this
We can choose a filter as T
y(k) =

ai Z(k i)

i=1

Find best ai a function learning problem.

PRNN Jan-Apr 2016 p.41/177

Examples contd... : Equaliser


Tx x(k) channel Z(k) filter y(k) Rx
We want y(k) = x(k). Design (or adapt) the filter to
achieve this
We can choose a filter as T
y(k) =

ai Z(k i)

i=1

Find best ai a function learning problem.


Training set: {(x(k), Z(k)), k = 1, 2, , N }
How do we know x(k) at the receiver end?
Prior agreements (Protocols)
PRNN Jan-Apr 2016 p.42/177

Learning from Examples

Both Classification and regression involve learning


from examples or learning from data.

Given a training set D = {(Xi , yi ), i = 1, 2, }, we


want to infer a model or function f such that on a
new data item, X , we predict y = f (X).

This is the general problem addressed in Machine


Learning.

PRNN Jan-Apr 2016 p.43/177

Machine Learning

Machine Learning: A set of methods that can


automatically detect regularities in data in a form that
is suitable for prediction or other decision-making
scenarios.
We can think of machine learning as a principled
approach to fitting models for many different kinds of
data.
Hence it is closely related to statistics.

PRNN Jan-Apr 2016 p.44/177

Machine learning encompasses many different data


analysis techniques.

Predictive or Supervised Learning:


Classification or regression: Given training data
learn to predict an attribute (y ) based on others
(X ).
Supervised in the sense we know the correct
answer (modulo noise) for training data.
Variations: Ordinal regression, semisupervised
learning etc.

In this course we will mainly discuss supervised


learning techniques.
PRNN Jan-Apr 2016 p.45/177

Reinforcement Learning: Learning by doing. For


example, how we learn to ride a bycycle.

The feedback we get during training is only


evaluative (and noisy).

Useful in many decision making and control


applications.

PRNN Jan-Apr 2016 p.46/177

Unsupervised Learning: Analysis of unlabelled data.


Clustering
Frequent Patterns (Market basket analysis)
Dimensionality reduction, latent factor analysis.
Topic Discovery
Collaborative filtering (Imputing missing values)

PRNN Jan-Apr 2016 p.47/177

Learning from examples

Given a raining set: {(X1 , y1 ), , (X , y )}, we


essentially want to fit a best function: y = f (X).

PRNN Jan-Apr 2016 p.48/177

Learning from examples

Given a raining set: {(X1 , y1 ), , (X , y )}, we


essentially want to fit a best function: y = f (X).

For X , this is a familiar curve-fitting problem.

PRNN Jan-Apr 2016 p.49/177

Learning from examples

Given a raining set: {(X1 , y1 ), , (X , y )}, we


essentially want to fit a best function: y = f (X).

For X , this is a familiar curve-fitting problem.

Model choice (choice of form for f ) is very important


here.
If we choose f to be a polynomial, higher the degree
better would be the performance on training data.

PRNN Jan-Apr 2016 p.50/177

Learning from examples

Given a raining set: {(X1 , y1 ), , (X , y )}, we


essentially want to fit a best function: y = f (X).

For X , this is a familiar curve-fitting problem.

Model choice (choice of form for f ) is very important


here.
If we choose f to be a polynomial, higher the degree
better would be the performance on training data.

But that is not the real performance A polynomial of


degree would give zero error!!

PRNN Jan-Apr 2016 p.51/177

Learning from examples Generalization

To obtain a classifier (or a regression function) we use


the training set.

We know the class label of patterns (or the values for


prediction variable) in the training set.

PRNN Jan-Apr 2016 p.52/177

Learning from examples Generalization

To obtain a classifier (or a regression function) we use


the training set.

We know the class label of patterns (or the values for


prediction variable) in the training set.

Errors on the training set do not necessarily tell how


good is the classifier.

PRNN Jan-Apr 2016 p.53/177

Learning from examples Generalization

To obtain a classifier (or a regression function) we use


the training set.

We know the class label of patterns (or the values for


prediction variable) in the training set.

Errors on the training set do not necessarily tell how


good is the classifier.

Any classifier that amounts to only storing the training


set is useless.
Interested in the generalization abilities how does
our classifier perform on unseen or new patterns.

PRNN Jan-Apr 2016 p.54/177

Design of Classifiers

The classifier should perform well inspite of inherent


variability of patterns and noise in feature extraction
and/or in class labels as given in training set.

PRNN Jan-Apr 2016 p.55/177

Design of Classifiers

The classifier should perform well inspite of inherent


variability of patterns and noise in feature extraction
and/or in class labels as given in training set.

Statistical Pattern Recognition An approach where


the variabilities are captured through probabilistic
models.

PRNN Jan-Apr 2016 p.56/177

Design of Classifiers

The classifier should perform well inspite of inherent


variability of patterns and noise in feature extraction
and/or in class labels as given in training set.

Statistical Pattern Recognition An approach where


the variabilities are captured through probabilistic
models.
There are other approaches, e.g., syntactic pattern
recognition, fuzzy-set based methods etc.

PRNN Jan-Apr 2016 p.57/177

Design of Classifiers

The classifier should perform well inspite of inherent


variability of patterns and noise in feature extraction
and/or in class labels as given in training set.

Statistical Pattern Recognition An approach where


the variabilities are captured through probabilistic
models.
There are other approaches, e.g., syntactic pattern
recognition, fuzzy-set based methods etc.

In this course we consider classification and


regression (function learning) problems in the
statistical framework.
PRNN Jan-Apr 2016 p.58/177

Statistical Pattern Recognition

X is the feature space. (We take X = n ). We


Consider a 2-class problem for simplicity of notation.

PRNN Jan-Apr 2016 p.59/177

Statistical Pattern Recognition

X is the feature space. (We take X = n ). We


Consider a 2-class problem for simplicity of notation.

Let: fi be the probability density function of the


feature vectors from class-i, i = 0, 1.

fi are called class conditional densities.

PRNN Jan-Apr 2016 p.60/177

Statistical Pattern Recognition

X is the feature space. (We take X = n ). We


Consider a 2-class problem for simplicity of notation.

Let: fi be the probability density function of the


feature vectors from class-i, i = 0, 1.

fi are called class conditional densities.

Let X = (X1 , , Xn ) n represent the feature


vector.
Then fi is the (conditional) joint density of the random
variables X1 , , Xn given that X is from class-i.

PRNN Jan-Apr 2016 p.61/177

Class conditional densities model the variability in the


feature values.
For example, the two classes can be uniformly
distributed in the two regions as shown. (The two
classes are separable here).

Class 1
Class 2

PRNN Jan-Apr 2016 p.62/177

When class regions are separable, an important


special case is linear separability.

Class 1
Class 2

Class 1

Class 2

The classes in the left panel above are linearly


separable (can be separated by a line) while those in
the right panel are not linearly separable (though
separable).
PRNN Jan-Apr 2016 p.63/177

In general, the two class conditional densities can


overlap. (The same value of feature vector can be
from different classes with different probabilities)

Class 1

Class 2

PRNN Jan-Apr 2016 p.64/177

The statistical viewpoint gives us one way of looking


for optimal classifier.

We can say we want a classifier that has least


probability of misclassifying a random pattern (drawn
from the underlying distributions).

PRNN Jan-Apr 2016 p.65/177

Let

qi (X) = Prob[class = i|X], i = 0, 1.


qi is called posterior probability (function) for class-i.

PRNN Jan-Apr 2016 p.66/177

Let

qi (X) = Prob[class = i|X], i = 0, 1.


qi is called posterior probability (function) for class-i.

Consider the classifier

h(X) = 0 if q0 (X) > q1 (X)


= 1 otherwise

PRNN Jan-Apr 2016 p.67/177

Let

qi (X) = Prob[class = i|X], i = 0, 1.


qi is called posterior probability (function) for class-i.

Consider the classifier

h(X) = 0 if q0 (X) > q1 (X)


= 1 otherwise

q0 (X) > q1 (X) would imply that the feature vector X


is more likely to come from class-0 rather than
class-1.
PRNN Jan-Apr 2016 p.68/177

Let

qi (X) = Prob[class = i|X], i = 0, 1.


qi is called posterior probability (function) for class-i.

Consider the classifier

h(X) = 0 if q0 (X) > q1 (X)


= 1 otherwise

q0 (X) > q1 (X) would imply that the feature vector X


is more likely to come from class-0 rather than
class-1.
Hence, intuitively, such a classifier should minimize
probability of error in classification.
PRNN Jan-Apr 2016 p.69/177

Statistical Pattern Recognition

X (= n ) is the feature space. Y = {0, 1} is the set of


class labels

PRNN Jan-Apr 2016 p.70/177

Statistical Pattern Recognition

X (= n ) is the feature space. Y = {0, 1} is the set of


class labels

H = {h | h : X Y} is the set of classifiers of


interest.

PRNN Jan-Apr 2016 p.71/177

Statistical Pattern Recognition

X (= n ) is the feature space. Y = {0, 1} is the set of


class labels

H = {h | h : X Y} is the set of classifiers of

interest.
For a feature vector X , let y(X) denote the class
label of X . In general, y(X) would be random.

PRNN Jan-Apr 2016 p.72/177

Statistical Pattern Recognition

X (= n ) is the feature space. Y = {0, 1} is the set of


class labels

H = {h | h : X Y} is the set of classifiers of

interest.
For a feature vector X , let y(X) denote the class
label of X . In general, y(X) would be random.

Now we want to assign a figure of merit to each


possible classifier in H.

PRNN Jan-Apr 2016 p.73/177

For example, we can rate different classifiers by

F (h) = Prob[h(X) = y(X)]

F (h) is the probability that h misclassifies a random


X.

PRNN Jan-Apr 2016 p.74/177

For example, we can rate different classifiers by

F (h) = Prob[h(X) = y(X)]

F (h) is the probability that h misclassifies a random


X.

Optimal classifier would be one with lowest value of


F.

PRNN Jan-Apr 2016 p.75/177

For example, we can rate different classifiers by

F (h) = Prob[h(X) = y(X)]

F (h) is the probability that h misclassifies a random


X.

Optimal classifier would be one with lowest value of


F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.

PRNN Jan-Apr 2016 p.76/177

For example, we can rate different classifiers by

F (h) = Prob[h(X) = y(X)]

F (h) is the probability that h misclassifies a random


X.

Optimal classifier would be one with lowest value of


F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.

Minimizing F is not a straight-forward optimization


problem.
PRNN Jan-Apr 2016 p.77/177

Statistical PR contd.

Recall that fi (X) denotes the probability density


function of feature vectors of class-i. ( class
conditional densities).

PRNN Jan-Apr 2016 p.78/177

Statistical PR contd.

Recall that fi (X) denotes the probability density


function of feature vectors of class-i. ( class
conditional densities).

Let pi = Prob[y(X) = i]. Called prior probabilities.

PRNN Jan-Apr 2016 p.79/177

Statistical PR contd.

Recall that fi (X) denotes the probability density


function of feature vectors of class-i. ( class
conditional densities).

Let pi = Prob[y(X) = i]. Called prior probabilities.

Recall, posterior probabilities,


qi (X) = Prob[y(X) = i | X].

PRNN Jan-Apr 2016 p.80/177

Statistical PR contd.

Recall that fi (X) denotes the probability density


function of feature vectors of class-i. ( class
conditional densities).

Let pi = Prob[y(X) = i]. Called prior probabilities.

Recall, posterior probabilities,


qi (X) = Prob[y(X) = i | X]. Now, by Bayes theorem

qi (X) = fi (X)pi / Z
where Z = f0 (X)p0 + f1 (X)p1 is the normalising
constant
PRNN Jan-Apr 2016 p.81/177

Bayes Classifier

Consider the classifier given by

q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

PRNN Jan-Apr 2016 p.82/177

Bayes Classifier

Consider the classifier given by

q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

This is called the Bayes classifier. Given our statistical


framework, this is the optimal classifier.

PRNN Jan-Apr 2016 p.83/177

Bayes Classifier

Consider the classifier given by

q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

This is called the Bayes classifier. Given our statistical


framework, this is the optimal classifier.

q0 (X) > q1 (X) is same as p0 f0 (X) > p1 f1 (X).

PRNN Jan-Apr 2016 p.84/177

Bayes Classifier

Consider the classifier given by

q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

This is called the Bayes classifier. Given our statistical


framework, this is the optimal classifier.

q0 (X) > q1 (X) is same as p0 f0 (X) > p1 f1 (X).

We will prove optimality of Bayes classifier later.

PRNN Jan-Apr 2016 p.85/177

Optimality contd.

PRNN Jan-Apr 2016 p.86/177

Bayes Classifier contd.

How does one implement Bayes Classifier?

PRNN Jan-Apr 2016 p.87/177

Bayes Classifier contd.

How does one implement Bayes Classifier?

We need posterior probabilities, qi (X). Not generally


available.

PRNN Jan-Apr 2016 p.88/177

Bayes Classifier contd.

How does one implement Bayes Classifier?

We need posterior probabilities, qi (X). Not generally


available.
It is enough if we can get fi (X) and pi , i=0,1. These
can be estimated from the training set of examples.

PRNN Jan-Apr 2016 p.89/177

Bayes Classifier contd.

How does one implement Bayes Classifier?

We need posterior probabilities, qi (X). Not generally


available.
It is enough if we can get fi (X) and pi , i=0,1. These
can be estimated from the training set of examples.

There are different techniques for estimating the class


conditional densities. (Fitting generative models to
data)

PRNN Jan-Apr 2016 p.90/177

Bayes Classifier contd.

How does one implement Bayes Classifier?

We need posterior probabilities, qi (X). Not generally


available.
It is enough if we can get fi (X) and pi , i=0,1. These
can be estimated from the training set of examples.

There are different techniques for estimating the class


conditional densities. (Fitting generative models to
data)

We can implement a Bayes Classifier with the


estimated quantities.
PRNN Jan-Apr 2016 p.91/177

Bayes Classifier (Contd.)

Bayes classifier minimizes probability of error


(misclassification).

PRNN Jan-Apr 2016 p.92/177

Bayes Classifier (Contd.)

Bayes classifier minimizes probability of error


(misclassification).

There are two kind of errors in classification.

PRNN Jan-Apr 2016 p.93/177

Bayes Classifier (Contd.)

Bayes classifier minimizes probability of error


(misclassification).

There are two kind of errors in classification.


Classifying C-0 as C-1 or C-1 as C-0

False Positive or False negative;


False Alarm or Missed detection

Tpe-I or Type-II;

PRNN Jan-Apr 2016 p.94/177

Bayes Classifier (Contd.)

Bayes classifier minimizes probability of error


(misclassification).

There are two kind of errors in classification.


Classifying C-0 as C-1 or C-1 as C-0

False Positive or False negative; Tpe-I or Type-II;


False Alarm or Missed detection
The costs for these errors may be different.
We may want to trade one type of errors with the
other type

PRNN Jan-Apr 2016 p.95/177

Statistical PR contd.

A more general way to assign figure of merit is to use


a loss function, L : Y Y + .

PRNN Jan-Apr 2016 p.96/177

Statistical PR contd.

A more general way to assign figure of merit is to use


a loss function, L : Y Y + .

The idea is that L(h(X), y(X)) denotes the loss


suffered by h on a pattern X .

PRNN Jan-Apr 2016 p.97/177

Statistical PR contd.

A more general way to assign figure of merit is to use


a loss function, L : Y Y + .

The idea is that L(h(X), y(X)) denotes the loss


suffered by h on a pattern X .

Now we can define

F (h) = E[L(h(X), y(X))]


The F (h) is called risk of h.

PRNN Jan-Apr 2016 p.98/177

Loss functions

The 0-1 loss fn:

L(a, b) = 0 if a = b
= 1 otherwise.

PRNN Jan-Apr 2016 p.99/177

Loss functions

The 0-1 loss fn:

L(a, b) = 0 if a = b
= 1 otherwise.
Now

F (h) = E[L(h(X), y(X))] = Prob[h(X) = y(X)].


(Same as before)

PRNN Jan-Apr 2016 p.100/177

Loss functions

A more general loss function is to have


L(0, 1) = L(1, 0). (L(0, 0) = L(1, 1) = 0).

PRNN Jan-Apr 2016 p.101/177

Loss functions

A more general loss function is to have


L(0, 1) = L(1, 0). (L(0, 0) = L(1, 1) = 0).

Now F (h) is the expected cost of misclassification.

PRNN Jan-Apr 2016 p.102/177

Loss functions

A more general loss function is to have


L(0, 1) = L(1, 0). (L(0, 0) = L(1, 1) = 0).

Now F (h) is the expected cost of misclassification.

The relative values of L(0, 1) and L(1, 0) determine


how we trade errors.

PRNN Jan-Apr 2016 p.103/177

Bayes Classifier to minimize risk

Bayes classifier under the more general loss function


is given by

q0 (X) L(0, 1)
>
hB (X) = 0 if
q1 (X) L(1, 0)
= 1 otherwise

PRNN Jan-Apr 2016 p.104/177

Bayes Classifier to minimize risk

Bayes classifier under the more general loss function


is given by

q0 (X) L(0, 1)
>
hB (X) = 0 if
q1 (X) L(1, 0)
= 1 otherwise

If L(0, 1) = L(1, 0) same as the earlier one

PRNN Jan-Apr 2016 p.105/177

Bayes Classifier to minimize risk

Bayes classifier under the more general loss function


is given by

q0 (X) L(0, 1)
>
hB (X) = 0 if
q1 (X) L(1, 0)
= 1 otherwise

If L(0, 1) = L(1, 0) same as the earlier one

If L(0, 1) = 3L(1, 0) then we want q0 (X) > 3q1 (X) to


say Class-0.

Can be shown to be optimal as before.


PRNN Jan-Apr 2016 p.106/177

Bayes classifier needs knowledge of class conditional


densities and prior probabilities. These need to be
estimated using the training set.

This can be computationally expensive or may not be


always feasible.

There are other methods for obtaining classifiers.

PRNN Jan-Apr 2016 p.107/177

Nearest Neighbour (NN) Classifier (Rule)

A simple classifier that often performs very well.

PRNN Jan-Apr 2016 p.108/177

Nearest Neighbour (NN) Classifier (Rule)

A simple classifier that often performs very well.

We store some feature vectors from the training set as


prototypes. (can be the whole training set). Note that
we know the (correct) class label for each prototype.

PRNN Jan-Apr 2016 p.109/177

Nearest Neighbour (NN) Classifier (Rule)

A simple classifier that often performs very well.

We store some feature vectors from the training set as


prototypes. (can be the whole training set). Note that
we know the (correct) class label for each prototype.

Given a new pattern (feature vector) X we find the


prototype X that is closest to X . Then classify X into
the same class as X .

PRNN Jan-Apr 2016 p.110/177

Nearest Neighbour Classifier contd.

A variation: k-NN rule.


Find the k prototypes closest to X . Classify X into
the majority class of these prototypes.

PRNN Jan-Apr 2016 p.111/177

Nearest Neighbour Classifier contd.

A variation: k-NN rule.


Find the k prototypes closest to X . Classify X into
the majority class of these prototypes.

There are two main issues in designing an NN


classifier.
Selection of Prototypes
Distance between feature vectors

PRNN Jan-Apr 2016 p.112/177

Nearest Neighbour Classifier contd.

A variation: k-NN rule.


Find the k prototypes closest to X . Classify X into
the majority class of these prototypes.

There are two main issues in designing an NN


classifier.
Selection of Prototypes
Distance between feature vectors
A very simple classifier to design and operate.

Time and memory needs depend on number of


prototypes and complexity of distance function.
PRNN Jan-Apr 2016 p.113/177

Nearest Neighbour Classifier contd.

Selection of Prototypes: How many? How to select?

PRNN Jan-Apr 2016 p.114/177

Nearest Neighbour Classifier contd.

Selection of Prototypes: How many? How to select?

Distance function:
Can use Eucledean distance.
"
d(X, X ) = (xi xi )2 .

PRNN Jan-Apr 2016 p.115/177

Nearest Neighbour Classifier contd.

Selection of Prototypes: How many? How to select?

Distance function:
Can use Eucledean distance.
"
d(X, X ) = (xi xi )2 .

A better method may be


" xi xi 2

d(X, X ) = ( i )
Here i is the (estimated) variance of ith feature.

PRNN Jan-Apr 2016 p.116/177

Nearest Neighbour Classifier contd.

Selection of Prototypes: How many? How to select?

Distance function:
Can use Eucledean distance.
"
d(X, X ) = (xi xi )2 .

A better method may be


" xi xi 2

d(X, X ) = ( i )
Here i is the (estimated) variance of ith feature.

A more general form is:

d(X, X ) = (X X )T 1 (X X )
where is the (estimated) covariance matrix. Called
Mahalanobis distance.
PRNN Jan-Apr 2016 p.117/177

Nearest Neighbour Classifier

The NN rule does not really use any statistical


viewpoint.

PRNN Jan-Apr 2016 p.118/177

Nearest Neighbour Classifier

The NN rule does not really use any statistical


viewpoint.

If we have a sequence of iid examples, asymptotically,


the probability of error by NN rule is never more than
twice the Bayes error.

PRNN Jan-Apr 2016 p.119/177

Nearest Neighbour Classifier

The NN rule does not really use any statistical


viewpoint.

If we have a sequence of iid examples, asymptotically,


the probability of error by NN rule is never more than
twice the Bayes error.

The NN rule is also related to certain non-parametric


methods of estimating class conditional densities

PRNN Jan-Apr 2016 p.120/177

Another approach: Discriminant functions

Consider the following structure for classifier

h(X) = 1 if g(X) > 0


= 0 Otherwise
g is called a discriminant function.

PRNN Jan-Apr 2016 p.121/177

Another approach: Discriminant functions

Consider the following structure for classifier

h(X) = 1 if g(X) > 0


= 0 Otherwise
g is called a discriminant function.

If we choose g(X) = q1 (X) q0 (X), then this is the


Bayes classifier.

PRNN Jan-Apr 2016 p.122/177

Another approach: Discriminant functions

Consider the following structure for classifier

h(X) = 1 if g(X) > 0


= 0 Otherwise
g is called a discriminant function.

If we choose g(X) = q1 (X) q0 (X), then this is the


Bayes classifier.

Instead of assuming functional form for class


conditional densities, we can assume a functional
form for g and learn the needed function.
PRNN Jan-Apr 2016 p.123/177

Linear Discriminant functions

Suppose g is specified by a parameter vector


W N . We write g(W, X) for g(X) now.

PRNN Jan-Apr 2016 p.124/177

Linear Discriminant functions

Suppose g is specified by a parameter vector


W N . We write g(W, X) for g(X) now.

For example, N = n + 1 and

g(W, X) =

n
!

w i xi + w 0

i=1

PRNN Jan-Apr 2016 p.125/177

Linear Discriminant functions

Suppose g is specified by a parameter vector


W N . We write g(W, X) for g(X) now.

For example, N = n + 1 and

g(W, X) =

n
!

w i xi + w 0

i=1

W = (w0 , . . . , wn )T N is the parameter vector,


and X = (x1 , . . . , xn )T n is the feature vector.
PRNN Jan-Apr 2016 p.126/177

Linear Discriminant functions

Suppose g is specified by a parameter vector


W N . We write g(W, X) for g(X) now.

For example, N = n + 1 and

g(W, X) =

n
!

w i xi + w 0

i=1

W = (w0 , . . . , wn )T N is the parameter vector,


and X = (x1 , . . . , xn )T n is the feature vector.

Known as linear discriminant function.


PRNN Jan-Apr 2016 p.127/177

A linear discriminant function based classifier is


!
h(X) = 1 if
w i xi + w 0 > 0

= 0 otherwise

PRNN Jan-Apr 2016 p.128/177

A linear discriminant function based classifier is


!
h(X) = 1 if
w i xi + w 0 > 0

= 0 otherwise

Let us take X = (1 x1 x2 xn )T . Called the


augumented feature vector.

Recall W = (w0 w1 wn )T .

PRNN Jan-Apr 2016 p.129/177

A linear discriminant function based classifier is


!
h(X) = 1 if
w i xi + w 0 > 0

= 0 otherwise

Let us take X = (1 x1 x2 xn )T . Called the


augumented feature vector.

Recall W = (w0 w1 wn )T .

Now the classifier is: h(X) = sgn(W T X).

One of the earliest classifiers considered (called


Perceptron).
PRNN Jan-Apr 2016 p.130/177

Linear discriminant functions contd.

The training set: {(Xi , yi ), i = 1, , }. of patterns


is said to be linearly separable if there exists W such
that

XiT W > 0 if yi = 1
< 0 if yi = 0

PRNN Jan-Apr 2016 p.131/177

Linear discriminant functions contd.

The training set: {(Xi , yi ), i = 1, , }. of patterns


is said to be linearly separable if there exists W such
that

XiT W > 0 if yi = 1
< 0 if yi = 0

Any W that satisfies the above is called a separating


hyperplane. (There exist infinitely many separating
hyperplanes)

PRNN Jan-Apr 2016 p.132/177

PRNN Jan-Apr 2016 p.133/177

Learning linear discriminant functions

We need to learn optimal W from the training


samples.

PRNN Jan-Apr 2016 p.134/177

Learning linear discriminant functions

We need to learn optimal W from the training


samples.

Perceptron learning algorithm is one of the earliest


algorithms for learning linear discriminant functions.

Finds a separating hyperplane if the training set is


linearly separable.

PRNN Jan-Apr 2016 p.135/177

Learning linear discriminant functions

We need to learn optimal W from the training


samples.

Perceptron learning algorithm is one of the earliest


algorithms for learning linear discriminant functions.

Finds a separating hyperplane if the training set is


linearly separable.

We can also have a risk-minimization approach to


learning discriminant functions.

PRNN Jan-Apr 2016 p.136/177

Learning discriminant functions

The question is: How to evaluate different W ?

PRNN Jan-Apr 2016 p.137/177

Learning discriminant functions

The question is: How to evaluate different W ?

We can use the old function, F .

F (W ) = E[L(h(W, X), y(X))]

PRNN Jan-Apr 2016 p.138/177

Learning discriminant functions

The question is: How to evaluate different W ?

We can use the old function, F .

F (W ) = E[L(h(W, X), y(X))]

We still have the problem of how to minimize F .

Given a W , we can not calculate F (W ) unless we


know the underlying probability distributions.

PRNN Jan-Apr 2016 p.139/177

Learning discriminant functions

The question is: How to evaluate different W ?

We can use the old function, F .

F (W ) = E[L(h(W, X), y(X))]

We still have the problem of how to minimize F .

Given a W , we can not calculate F (W ) unless we


know the underlying probability distributions.

But we can approximate expectation by sample


average.

PRNN Jan-Apr 2016 p.140/177

Consider the function F defined by

F (W ) =

!
1

L(h(W, Xi ), yi )

i=1

where {(Xi , yi ), i = 1, , } is the training set of


examples.

PRNN Jan-Apr 2016 p.141/177

Consider the function F defined by

F (W ) =

!
1

L(h(W, Xi ), yi )

i=1

where {(Xi , yi ), i = 1, , } is the training set of


examples.

Then F (W ) is a good approximation to F (W ). (Law


of large numbers; assume examples are iid)

PRNN Jan-Apr 2016 p.142/177

Consider the function F defined by

F (W ) =

!
1

L(h(W, Xi ), yi )

i=1

where {(Xi , yi ), i = 1, , } is the training set of


examples.

Then F (W ) is a good approximation to F (W ). (Law


of large numbers; assume examples are iid)

So, we can minimize F instead.


PRNN Jan-Apr 2016 p.143/177

Learning discriminant functions

F measures the error of classifier h(W, ) on the


training samples.

F measures error on the full population.

PRNN Jan-Apr 2016 p.144/177

Learning discriminant functions

F measures the error of classifier h(W, ) on the


training samples.

F measures error on the full population.


But F cannot be calculated because we do not know
the underlying probability distributions. (F can be
claculated)

If we have sufficient number of representative


training samples then minimizer of F would be good
enough.

PRNN Jan-Apr 2016 p.145/177

Learning discriminant functions

F measures the error of classifier h(W, ) on the


training samples.

F measures error on the full population.


But F cannot be calculated because we do not know
the underlying probability distributions. (F can be
calculated)

If we have sufficient number of representative


training samples then minimizer of F would be good
enough.

PRNN Jan-Apr 2016 p.146/177

Empirical Risk Minimization

Empirical risk is the training error

F (W ) =

!
1

L(h(W, Xi ), yi )

i=1

We search over a family of classifiers, H, to minimize


empirical risk.

The true risk is the test error

F (W ) = E[L(h(W, X), y(X))]

We actualy want the minimizer of the true risk.


PRNN Jan-Apr 2016 p.147/177

let Wm be the minimizer of the empirical risk given m


iid examples.

Then we can get bounds such as


#$
%
complexity(H)

F (Wm ) F (Wm ) + K

Hence empirical risk minimization is good if we get


low training error and have the number of examples is
large relative to the complexity of the classifier.

PRNN Jan-Apr 2016 p.148/177

Learning discriminant functions contd.

We need to minimize

F (W ) =

!
1

L(h(W, Xi ), yi )

i=1

How to find W that minimizes F (W )?

PRNN Jan-Apr 2016 p.149/177

Learning discriminant functions contd.

We need to minimize

F (W ) =

!
1

L(h(W, Xi ), yi )

i=1

How to find W that minimizes F (W )?

In general we need some optimization techniques.

PRNN Jan-Apr 2016 p.150/177

Learning discriminant functions contd.

We need to minimize

F (W ) =

!
1

L(h(W, Xi ), yi )

i=1

How to find W that minimizes F (W )?

In general we need some optimization techniques.

As defined, h (and hence F ) are discontinuous. Also,


if we use 01 loss function, L is also discontinuous.

PRNN Jan-Apr 2016 p.151/177

Learning discriminant functions contd.

We can redefine h so that it is smooth.


For example, we can take h(W, X) = W T X while
learning and finally use h(W, X) = sign(W T X) as
the classifier.
1
Or we can take: h(W, X) = 1+exp(W
T X) .
This is called a sigmoid function. Now h is nonlinear
(in W T X ) and differentiable.

PRNN Jan-Apr 2016 p.152/177

PRNN Jan-Apr 2016 p.153/177

Other Loss functions

To make the optimization easier we can use other loss


functions.
squared error loss:
L(h(X), y) = (h(X) y)2

hinge loss:

L(h(X), y) = max(0, 1 yh(X))

There are many other loss function that one can use.

PRNN Jan-Apr 2016 p.154/177

Learning linear discriminant functions

Suppose we use squared error loss.

F (W ) =

!
1

(h(W, X) y(X))2

i=1

.
Now we can use standard optimization techniques to
minimize F .
If h(W, X) = W T X then this is standard linear least
squares estimation.
If we use sigmoid function then it is called logistic
regression.
PRNN Jan-Apr 2016 p.155/177

There are efficient algorithms for learning linear


discriminant functions.
In learning such linear models we can use W T (X)
instead of W T X where (X) = [1 (X), , m (X)]T
as long as i are fixed functions. It is like using
zi = i (X) as features.
We would be discussing some techniques for learning
linear discriminant functions.

PRNN Jan-Apr 2016 p.156/177

Beyond Linear Models

Learning linear models (classifiers) is generally


efficient.

PRNN Jan-Apr 2016 p.157/177

Beyond Linear Models

Learning linear models (classifiers) is generally


efficient.
However, linear models are not always sufficient.

Best linear functions may still be a poor fit.

PRNN Jan-Apr 2016 p.158/177

Beyond Linear Models

Learning linear models (classifiers) is generally


efficient.
However, linear models are not always sufficient.

Best linear functions may still be a poor fit.

How to tackle general situations?

Here are some possible viewpoints.

PRNN Jan-Apr 2016 p.159/177

Neural network idea

Find a good parameterized class of nonlinear


discriminant functions

PRNN Jan-Apr 2016 p.160/177

Neural network idea

Find a good parameterized class of nonlinear


discriminant functions
Multilayer feedforward neural nets are one such class

PRNN Jan-Apr 2016 p.161/177

Neural network idea

Find a good parameterized class of nonlinear


discriminant functions
Multilayer feedforward neural nets are one such class

PRNN Jan-Apr 2016 p.162/177

Neural network idea

Find a good parameterized class of nonlinear


discriminant functions
Multilayer feedforward neural nets are one such class

Nonlinear functions are built up through composition


of summation and sigmoids.

Useful for both classification and Regression.


PRNN Jan-Apr 2016 p.163/177

Decision Tree idea

Divide feature space so that a linear classifier is


enough in each region (e.g. Decision Trees).

PRNN Jan-Apr 2016 p.164/177

Decision Tree idea

Divide feature space so that a linear classifier is


enough in each region (e.g. Decision Trees).

0.9

H1

0.8

H2

H3

H1

0.7

0.6

0.5

0.4

H2

0.3

H3

0.2

0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

C0

C1 C0

C1

PRNN Jan-Apr 2016 p.165/177

Decision Tree idea

Divide feature space so that a linear classifier is


enough in each region (e.g. Decision Trees).

0.9

H1

0.8

H2

H3

H1

0.7

0.6

0.5

0.4

H2

0.3

H3

0.2

0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

C0

C1 C0

C1

Such tree-based models are possible for regression


also.
PRNN Jan-Apr 2016 p.166/177

SVM idea

Map X nonlinearly into a high dimensional space and


try a linear discriminant function there. (e.g. Support
Vector Machines)

PRNN Jan-Apr 2016 p.167/177

SVM idea

Map X nonlinearly into a high dimensional space and


try a linear discriminant function there. (e.g. Support
Vector Machines)

Let X = [x1 x2 ] and let : 2 5 given by

Z = (X) = [1 x1 x2 x21 x22 x1 x2 ]

PRNN Jan-Apr 2016 p.168/177

SVM idea

Map X nonlinearly into a high dimensional space and


try a linear discriminant function there. (e.g. Support
Vector Machines)

Let X = [x1 x2 ] and let : 2 5 given by

Z = (X) = [1 x1 x2 x21 x22 x1 x2 ]

Now,

g(X) = a0 + a1 x1 + a2 x2 + a3 x21 + a4 x22 + a5 x1 x2


is a quadratic discriminant function in 2 ; but
g(Z) = a0 + a1 z1 + a2 z2 + a3 z3 + a4 z4 + a5 z5
is a linear dscriminant function in the (X) space.
PRNN Jan-Apr 2016 p.169/177

PRNN Jan-Apr 2016 p.170/177

Organization of this course

Bayes classifier for minimizing risk (and some


variations)

Estimation techniques for class conditional densities.

Parametric and non-parametric models

ML and Bayesian estimation

Mixture models and EM algorithm

PRNN Jan-Apr 2016 p.171/177

Learning linear classifiers and regression models

Perceptron and LMS algorithm

Linear Least squares estimation.

Logistic regression.

Probabilistic graphical models (?)

PRNN Jan-Apr 2016 p.172/177

Simple intro to statistical learning theory

Complexity of a learning problem VC dimension

Consistency of Empirical Risk minimization

PRNN Jan-Apr 2016 p.173/177

Learning Nonlinear models Neural networks.

Feedforward networks and Backpropagation

Radial Basis function networks.


Recurrent neural networks
The issues in deep neural networks

CNN, RBM, Autoencoder models

PRNN Jan-Apr 2016 p.174/177

Learning nonlinear models SVMs

Optimal Separating hyperplane

Support Vector Machine for learning optimal


hyperplanes

SVM (Kernel based) methods for classification and


regression

PRNN Jan-Apr 2016 p.175/177

Feature Extraction/dimensionality reduction

Boosting and other variance reduction techniques

Multiple classifier combinations

PRNN Jan-Apr 2016 p.176/177

PRNN Jan-Apr 2016 p.177/177

S-ar putea să vă placă și