PRNN P S Sastry Lec 1

Pattern Recognition and Neural Networks
P.S.Sastry
sastry@ee.iisc.ernet.in
Dept. Electrical Engineering

Indian Institute of Science, Bangalore
PRNN Jan-Apr 2016 p.1/177
Reference Books
R.O.Duda, P.E.Hart and D.G.Stork, Pattern

Classification, Johy Wiley, 2002
C.M.Bishop, Pattern Recognition and Machine

Learning, Springer, 2006.
T.Hastie, R.Tibshirani and J.Friedman The Elements

of Statistical Learning: Data Mining, Inference and
Prediction, Springer, 2009.
Kevin Murphy, Machine Learning: A Probabilistic

Perspective, MIT Press, 2012.
R.O.Duda and P.E. Hart, Pattern Classification and

Scene Analysis, Wiley, 1973
Pattern Recognition
A basic attribute of people categorisation of sensory
input
Pattern PR System Class label
Pattern Recognition
A basic attribute of people categorisation of sensory
input
Pattern PR System Class label
Examples of Pattern Recognition tasks
Reading facial expressions
Recognising Speech
Reading a Document
Identifying a person by fingerprints
Diagnosis from medical images
Wine tasting
Machine Recognition of Patterns

X
Pattern feature extractor classifier class label
Machine Recognition of Patterns

X
Pattern feature extractor classifier class label
Feature extractor makes some measurements on the

input pattern.
X is called Feature Vector. Often, X n .
Classifier maps each feature vector to a class label.
Features to be used are problem-specific.
Some Examples of PR Tasks
Character Recognition
Pattern Image.
Class identity of character
Features: Binary image, projections (e.g., row and
column sums), Moments etc.
Examples Contd.
Speech Recognition
Pattern 1-D signal (or its sampled version)
Class identity of speech units
Features LPC model of chunks of speech,
spectral info, cepstrum etc.
Pattern can become a sequence of feature vectors.
Examples contd...
Document Classification (e.g., spam detection)

Pattern A document
Class The type of document
Features word occurrence counts, word context
etc.
Examples contd...
Document Classification (e.g., spam detection)

Pattern A document
Class The type of document
Features word occurrence counts, word context
etc.
Many other applications:
Biometrics-based authentication,
Video Surveillance,
Credit Screening,
Imposter Detection, diagnostics of machinery etc.
Design of Pattern Recognition Systems
Features depend on the problem. Measure relevant

quantities.

quantities.
Should we design features or learn them?

quantities.
Some techniques available to extract more relevant

quantities from the initial measurements. (e.g., PCA)

quantities.

After feature extraction each pattern is a vector
Classifier is a function to map such vectors into class

labels.

quantities.

After feature extraction each pattern is a vector
Classifier is a function to map such vectors into class

labels.
Many general techniques of classifier design are
available.
This course is about techniques for obtaining
classifiers.
Some notation
Feature Space, X Set of all possible feature vectors.
Some notation
Classifier: a decision rule or a function,

h : X {1, , M }.
Often, X = n . Convenient to take M = 2.

Then we take the labels as {0, 1} or {1, 1}.
Some notation
Classifier: a decision rule or a function,

h : X {1, , M }.
Often, X = n . Convenient to take M = 2.

Then we take the labels as {0, 1} or {1, 1}.
Then, any binary valued function on X is a classifier.

What h to choose? We want correct or optimal
classifier.
How to design classifiers?
How to judge performance?
How to provide performance guarentees?
We first consider the 2-class problem.
Can handle the M > 2 case if we know how to handle

2-class problem.

2-class problem.
Simplest alternative: design M 2-class classifiers.

One Vs Rest

2-class problem.

One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.

2-class problem.

One Vs Rest
classifiers.
The 2-class problem is the basic problem.

2-class problem.

One Vs Rest
classifiers.
The 2-class problem is the basic problem.
We will also look at M-class classifiesrs.
A simple PR problem
: Problem: Spot the Right Candidate
: Features:
x1 : Marks based on academic record
x2 : Marks in the interview
A simple PR problem
: Features:
A Classifier: ax1 + bx2 > c Good
A simple PR problem
: Features:

Another Classifier: x1 x2 > c Good
(or (x1 + a)(x2 + b) > c).
A simple PR problem
: Features:

Another Classifier: x1 x2 > c Good
(or (x1 + a)(x2 + b) > c).
Design of classifier:
We have to choose a specific form for the classifier.
What values to use for parameters such as a, b, c?
Designing Classifiers
Need to decide how feature vector values determine

the class.
(How different marks reflect goodness of candidate)

the class.
In most applications, not possible to design classifier

from physics of the problem.

the class.
In most applications, not possible to design classifier

from physics of the problem.
The difficulties are

Lot of variability in patterns of a single class
Variability in feature vector values
Feature vectors of patterns from different classes
can be arbitrarily close.
Noise in measurements
Designing Classifiers contd...
Often the only information available for the design is

A training set of example patterns.
Training set: {(Xi , yi ), i = 1, . . . , }.

Here Xi is an example feature vector of class yi .
Designing Classifiers contd...
Often the only information available for the design is

A training set of example patterns.
Training set: {(Xi , yi ), i = 1, . . . , }.

Here Xi is an example feature vector of class yi .
Generation of training set Take representative

patterns of known category.
Now learn an appropriate function h as classifier.
Test and validate the classifier on more data.
Training Set
Another example problem
Problem: recognize persons of medium build
Features: Height and Weight
The classifier is nonlinear here

Function Learning
Closely related problem. Output is continuous-valued

rather than discrete as in classifiers.
Here training set of examples could be
{(Xi , yi ), i = 1, . . . , }, Xi X , yi .
Function Learning
Closely related problem. Output is continuous-valued

rather than discrete as in classifiers.
Here training set of examples could be
{(Xi , yi ), i = 1, . . . , }, Xi X , yi .
The prediction variable, y , is continuous; rather than

taking finitely many values. ( There can be noise in
examples).
Similar learning techniques needed to infer the

underlying functional relationship between X and y .
(Regression function of y on X ).
Examples of Function Learning
Time series prediction: Given a series x1 , x2 , , find

a function to predict xn .

Based on past values: Find a best function
xn = h(xn1 , xn2 , , xnp )
Predict stock prices, exchange rates etc.

Linear prediction model used in speech analysis

Based on past values: Find a best function
xn = h(xn1 , xn2 , , xnp )
Predict stock prices, exchange rates etc.

Linear prediction model used in speech analysis
More general predictors can use other variables also.

Predict rainfall based on measurements and
(possibly) previous years data.
In general, System Identification. (An application:
smart sensors)
Examples contd... : Equaliser

Tx x(k) channel Z(k) filter y(k) Rx
We want y(k) = x(k). Design (or adapt) the filter to
achieve this
We can choose a filter as T
y(k) =
ai Z(k i)
i=1
Find best ai a function learning problem.
Examples contd... : Equaliser

Tx x(k) channel Z(k) filter y(k) Rx
We want y(k) = x(k). Design (or adapt) the filter to
achieve this
We can choose a filter as T
y(k) =
ai Z(k i)
i=1
Find best ai a function learning problem.

Training set: {(x(k), Z(k)), k = 1, 2, , N }
How do we know x(k) at the receiver end?
Prior agreements (Protocols)
Learning from Examples
Both Classification and regression involve learning

from examples or learning from data.
Given a training set D = {(Xi , yi ), i = 1, 2, }, we

want to infer a model or function f such that on a
new data item, X , we predict y = f (X).
This is the general problem addressed in Machine

Learning.
Machine Learning
Machine Learning: A set of methods that can

automatically detect regularities in data in a form that
is suitable for prediction or other decision-making
scenarios.
We can think of machine learning as a principled
approach to fitting models for many different kinds of
data.
Hence it is closely related to statistics.
Machine learning encompasses many different data

analysis techniques.
Predictive or Supervised Learning:

Classification or regression: Given training data
learn to predict an attribute (y ) based on others
(X ).
Supervised in the sense we know the correct
answer (modulo noise) for training data.
Variations: Ordinal regression, semisupervised
learning etc.
In this course we will mainly discuss supervised

learning techniques.
Reinforcement Learning: Learning by doing. For

example, how we learn to ride a bycycle.
The feedback we get during training is only

evaluative (and noisy).
Useful in many decision making and control

applications.
Unsupervised Learning: Analysis of unlabelled data.

Clustering
Frequent Patterns (Market basket analysis)
Dimensionality reduction, latent factor analysis.
Topic Discovery
Collaborative filtering (Imputing missing values)
Learning from examples
Given a raining set: {(X1 , y1 ), , (X , y )}, we

essentially want to fit a best function: y = f (X).

For X , this is a familiar curve-fitting problem.

Model choice (choice of form for f ) is very important

here.
If we choose f to be a polynomial, higher the degree
better would be the performance on training data.

Model choice (choice of form for f ) is very important

here.
If we choose f to be a polynomial, higher the degree
better would be the performance on training data.
But that is not the real performance A polynomial of

degree would give zero error!!
Learning from examples Generalization
To obtain a classifier (or a regression function) we use

the training set.
We know the class label of patterns (or the values for

prediction variable) in the training set.

the training set.

Errors on the training set do not necessarily tell how

good is the classifier.

the training set.

Errors on the training set do not necessarily tell how

good is the classifier.
Any classifier that amounts to only storing the training

set is useless.
Interested in the generalization abilities how does
our classifier perform on unseen or new patterns.
Design of Classifiers
The classifier should perform well inspite of inherent

variability of patterns and noise in feature extraction
and/or in class labels as given in training set.

Statistical Pattern Recognition An approach where

the variabilities are captured through probabilistic
models.


models.
There are other approaches, e.g., syntactic pattern
recognition, fuzzy-set based methods etc.


models.
There are other approaches, e.g., syntactic pattern
recognition, fuzzy-set based methods etc.
In this course we consider classification and

regression (function learning) problems in the
statistical framework.
Statistical Pattern Recognition
X is the feature space. (We take X = n ). We

Consider a 2-class problem for simplicity of notation.

Let: fi be the probability density function of the

feature vectors from class-i, i = 0, 1.
fi are called class conditional densities.

Let: fi be the probability density function of the

feature vectors from class-i, i = 0, 1.
fi are called class conditional densities.
Let X = (X1 , , Xn ) n represent the feature

vector.
Then fi is the (conditional) joint density of the random
variables X1 , , Xn given that X is from class-i.
Class conditional densities model the variability in the

feature values.
For example, the two classes can be uniformly
distributed in the two regions as shown. (The two
classes are separable here).
Class 1
Class 2
When class regions are separable, an important

special case is linear separability.
Class 1
Class 2
Class 1
Class 2
The classes in the left panel above are linearly

separable (can be separated by a line) while those in
the right panel are not linearly separable (though
separable).
In general, the two class conditional densities can

overlap. (The same value of feature vector can be
from different classes with different probabilities)
Class 1
Class 2
The statistical viewpoint gives us one way of looking

for optimal classifier.
We can say we want a classifier that has least

probability of misclassifying a random pattern (drawn
from the underlying distributions).
Let
qi (X) = Prob[class = i|X], i = 0, 1.

qi is called posterior probability (function) for class-i.
Let

Consider the classifier
h(X) = 0 if q0 (X) > q1 (X)

= 1 otherwise
Let

h(X) = 0 if q0 (X) > q1 (X)

= 1 otherwise
q0 (X) > q1 (X) would imply that the feature vector X

is more likely to come from class-0 rather than
class-1.
Let

h(X) = 0 if q0 (X) > q1 (X)

= 1 otherwise
q0 (X) > q1 (X) would imply that the feature vector X

is more likely to come from class-0 rather than
class-1.
Hence, intuitively, such a classifier should minimize
probability of error in classification.
X (= n ) is the feature space. Y = {0, 1} is the set of

class labels

class labels
H = {h | h : X Y} is the set of classifiers of

interest.

class labels
interest.
For a feature vector X , let y(X) denote the class
label of X . In general, y(X) would be random.

class labels
interest.
For a feature vector X , let y(X) denote the class
label of X . In general, y(X) would be random.
Now we want to assign a figure of merit to each

possible classifier in H.
For example, we can rate different classifiers by
F (h) = Prob[h(X) = y(X)]
F (h) is the probability that h misclassifies a random

X.

X.
Optimal classifier would be one with lowest value of

F.

X.

F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.

X.

F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.
Minimizing F is not a straight-forward optimization

problem.
Statistical PR contd.
Recall that fi (X) denotes the probability density

function of feature vectors of class-i. ( class
conditional densities).

Let pi = Prob[y(X) = i]. Called prior probabilities.

Recall, posterior probabilities,

qi (X) = Prob[y(X) = i | X].

Recall, posterior probabilities,

qi (X) = Prob[y(X) = i | X]. Now, by Bayes theorem
qi (X) = fi (X)pi / Z
where Z = f0 (X)p0 + f1 (X)p1 is the normalising
constant
Bayes Classifier
Consider the classifier given by
q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise
Bayes Classifier
q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise
This is called the Bayes classifier. Given our statistical

framework, this is the optimal classifier.
Bayes Classifier
q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

q0 (X) > q1 (X) is same as p0 f0 (X) > p1 f1 (X).
Bayes Classifier
q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

q0 (X) > q1 (X) is same as p0 f0 (X) > p1 f1 (X).
We will prove optimality of Bayes classifier later.
Optimality contd.
Bayes Classifier contd.
How does one implement Bayes Classifier?
We need posterior probabilities, qi (X). Not generally

available.

available.
It is enough if we can get fi (X) and pi , i=0,1. These
can be estimated from the training set of examples.

available.
There are different techniques for estimating the class

conditional densities. (Fitting generative models to
data)

available.
There are different techniques for estimating the class

conditional densities. (Fitting generative models to
data)
We can implement a Bayes Classifier with the

estimated quantities.
Bayes Classifier (Contd.)
Bayes classifier minimizes probability of error

(misclassification).

There are two kind of errors in classification.


Classifying C-0 as C-1 or C-1 as C-0
False Positive or False negative;

False Alarm or Missed detection
Tpe-I or Type-II;


Classifying C-0 as C-1 or C-1 as C-0
False Positive or False negative; Tpe-I or Type-II;

False Alarm or Missed detection
The costs for these errors may be different.
We may want to trade one type of errors with the
other type
A more general way to assign figure of merit is to use

a loss function, L : Y Y + .

The idea is that L(h(X), y(X)) denotes the loss

suffered by h on a pattern X .

The idea is that L(h(X), y(X)) denotes the loss

suffered by h on a pattern X .
Now we can define
F (h) = E[L(h(X), y(X))]

The F (h) is called risk of h.
Loss functions
The 0-1 loss fn:
L(a, b) = 0 if a = b
= 1 otherwise.
Loss functions
The 0-1 loss fn:
L(a, b) = 0 if a = b
= 1 otherwise.
Now
F (h) = E[L(h(X), y(X))] = Prob[h(X) = y(X)].

(Same as before)
Loss functions
A more general loss function is to have

L(0, 1) = L(1, 0). (L(0, 0) = L(1, 1) = 0).
Loss functions

L(0, 1) = L(1, 0). (L(0, 0) = L(1, 1) = 0).
Now F (h) is the expected cost of misclassification.
Loss functions

L(0, 1) = L(1, 0). (L(0, 0) = L(1, 1) = 0).
Now F (h) is the expected cost of misclassification.
The relative values of L(0, 1) and L(1, 0) determine

how we trade errors.
Bayes Classifier to minimize risk
Bayes classifier under the more general loss function

is given by
q0 (X) L(0, 1)
>
hB (X) = 0 if
q1 (X) L(1, 0)
= 1 otherwise

is given by
q0 (X) L(0, 1)
>
hB (X) = 0 if
q1 (X) L(1, 0)
= 1 otherwise
If L(0, 1) = L(1, 0) same as the earlier one

is given by
q0 (X) L(0, 1)
>
hB (X) = 0 if
q1 (X) L(1, 0)
= 1 otherwise
If L(0, 1) = L(1, 0) same as the earlier one
If L(0, 1) = 3L(1, 0) then we want q0 (X) > 3q1 (X) to

say Class-0.
Can be shown to be optimal as before.

Bayes classifier needs knowledge of class conditional

densities and prior probabilities. These need to be
estimated using the training set.
This can be computationally expensive or may not be

always feasible.
There are other methods for obtaining classifiers.
Nearest Neighbour (NN) Classifier (Rule)
A simple classifier that often performs very well.
We store some feature vectors from the training set as

prototypes. (can be the whole training set). Note that
we know the (correct) class label for each prototype.
We store some feature vectors from the training set as

prototypes. (can be the whole training set). Note that
we know the (correct) class label for each prototype.
Given a new pattern (feature vector) X we find the

prototype X that is closest to X . Then classify X into
the same class as X .
Nearest Neighbour Classifier contd.
A variation: k-NN rule.

Find the k prototypes closest to X . Classify X into
the majority class of these prototypes.

There are two main issues in designing an NN

classifier.
Selection of Prototypes
Distance between feature vectors

There are two main issues in designing an NN

classifier.
Selection of Prototypes
Distance between feature vectors
A very simple classifier to design and operate.
Time and memory needs depend on number of

prototypes and complexity of distance function.
Selection of Prototypes: How many? How to select?
Distance function:
Can use Eucledean distance.
"
d(X, X ) = (xi xi )2 .
Distance function:
"
d(X, X ) = (xi xi )2 .
A better method may be

" xi xi 2
d(X, X ) = ( i )
Here i is the (estimated) variance of ith feature.
Distance function:
"
d(X, X ) = (xi xi )2 .
A better method may be

" xi xi 2
d(X, X ) = ( i )
Here i is the (estimated) variance of ith feature.
A more general form is:
d(X, X ) = (X X )T 1 (X X )
where is the (estimated) covariance matrix. Called
Mahalanobis distance.
Nearest Neighbour Classifier
The NN rule does not really use any statistical

viewpoint.

viewpoint.
If we have a sequence of iid examples, asymptotically,

the probability of error by NN rule is never more than
twice the Bayes error.

viewpoint.
If we have a sequence of iid examples, asymptotically,

the probability of error by NN rule is never more than
twice the Bayes error.
The NN rule is also related to certain non-parametric

methods of estimating class conditional densities
Another approach: Discriminant functions
Consider the following structure for classifier
h(X) = 1 if g(X) > 0

= 0 Otherwise
g is called a discriminant function.
h(X) = 1 if g(X) > 0

= 0 Otherwise
If we choose g(X) = q1 (X) q0 (X), then this is the

Bayes classifier.
h(X) = 1 if g(X) > 0

= 0 Otherwise
If we choose g(X) = q1 (X) q0 (X), then this is the

Bayes classifier.
Instead of assuming functional form for class

conditional densities, we can assume a functional
form for g and learn the needed function.
Linear Discriminant functions
Suppose g is specified by a parameter vector

W N . We write g(W, X) for g(X) now.

For example, N = n + 1 and
g(W, X) =
n
!
w i xi + w 0
i=1

g(W, X) =
n
!
w i xi + w 0
i=1
W = (w0 , . . . , wn )T N is the parameter vector,

and X = (x1 , . . . , xn )T n is the feature vector.

g(W, X) =
n
!
w i xi + w 0
i=1
W = (w0 , . . . , wn )T N is the parameter vector,

and X = (x1 , . . . , xn )T n is the feature vector.
Known as linear discriminant function.

A linear discriminant function based classifier is

!
h(X) = 1 if
w i xi + w 0 > 0
= 0 otherwise

!
h(X) = 1 if
w i xi + w 0 > 0
= 0 otherwise
Let us take X = (1 x1 x2 xn )T . Called the

augumented feature vector.
Recall W = (w0 w1 wn )T .

!
h(X) = 1 if
w i xi + w 0 > 0
= 0 otherwise
Let us take X = (1 x1 x2 xn )T . Called the

augumented feature vector.
Recall W = (w0 w1 wn )T .
Now the classifier is: h(X) = sgn(W T X).
One of the earliest classifiers considered (called

Perceptron).
Linear discriminant functions contd.
The training set: {(Xi , yi ), i = 1, , }. of patterns

is said to be linearly separable if there exists W such
that
XiT W > 0 if yi = 1
< 0 if yi = 0
Linear discriminant functions contd.
The training set: {(Xi , yi ), i = 1, , }. of patterns

is said to be linearly separable if there exists W such
that
XiT W > 0 if yi = 1
< 0 if yi = 0
Any W that satisfies the above is called a separating

hyperplane. (There exist infinitely many separating
hyperplanes)
Learning linear discriminant functions
We need to learn optimal W from the training

samples.

samples.
Perceptron learning algorithm is one of the earliest

algorithms for learning linear discriminant functions.
Finds a separating hyperplane if the training set is

linearly separable.

samples.
Perceptron learning algorithm is one of the earliest

algorithms for learning linear discriminant functions.
Finds a separating hyperplane if the training set is

linearly separable.
We can also have a risk-minimization approach to

learning discriminant functions.
Learning discriminant functions
The question is: How to evaluate different W ?
We can use the old function, F .
F (W ) = E[L(h(W, X), y(X))]
F (W ) = E[L(h(W, X), y(X))]
We still have the problem of how to minimize F .
Given a W , we can not calculate F (W ) unless we

know the underlying probability distributions.
F (W ) = E[L(h(W, X), y(X))]
We still have the problem of how to minimize F .
Given a W , we can not calculate F (W ) unless we

know the underlying probability distributions.
But we can approximate expectation by sample

average.
Consider the function F defined by
F (W ) =
!
1
L(h(W, Xi ), yi )
i=1
where {(Xi , yi ), i = 1, , } is the training set of

examples.
F (W ) =
!
1
L(h(W, Xi ), yi )
i=1

examples.
Then F (W ) is a good approximation to F (W ). (Law

of large numbers; assume examples are iid)
F (W ) =
!
1
L(h(W, Xi ), yi )
i=1

examples.
Then F (W ) is a good approximation to F (W ). (Law

of large numbers; assume examples are iid)
So, we can minimize F instead.

F measures the error of classifier h(W, ) on the

training samples.
F measures error on the full population.

training samples.

But F cannot be calculated because we do not know
the underlying probability distributions. (F can be
claculated)
If we have sufficient number of representative

training samples then minimizer of F would be good
enough.

training samples.

But F cannot be calculated because we do not know
the underlying probability distributions. (F can be
calculated)
If we have sufficient number of representative

training samples then minimizer of F would be good
enough.
Empirical Risk Minimization
Empirical risk is the training error
F (W ) =
!
1
L(h(W, Xi ), yi )
i=1
We search over a family of classifiers, H, to minimize

empirical risk.
The true risk is the test error
F (W ) = E[L(h(W, X), y(X))]
We actualy want the minimizer of the true risk.

let Wm be the minimizer of the empirical risk given m

iid examples.
Then we can get bounds such as

#$
%
complexity(H)
F (Wm ) F (Wm ) + K
Hence empirical risk minimization is good if we get

low training error and have the number of examples is
large relative to the complexity of the classifier.
Learning discriminant functions contd.
We need to minimize
F (W ) =
!
1
L(h(W, Xi ), yi )
i=1
How to find W that minimizes F (W )?
We need to minimize
F (W ) =
!
1
L(h(W, Xi ), yi )
i=1
In general we need some optimization techniques.
We need to minimize
F (W ) =
!
1
L(h(W, Xi ), yi )
i=1
In general we need some optimization techniques.
As defined, h (and hence F ) are discontinuous. Also,

if we use 01 loss function, L is also discontinuous.
We can redefine h so that it is smooth.

For example, we can take h(W, X) = W T X while
learning and finally use h(W, X) = sign(W T X) as
the classifier.
1
Or we can take: h(W, X) = 1+exp(W
T X) .
This is called a sigmoid function. Now h is nonlinear
(in W T X ) and differentiable.
Other Loss functions
To make the optimization easier we can use other loss

functions.
squared error loss:
L(h(X), y) = (h(X) y)2
hinge loss:
L(h(X), y) = max(0, 1 yh(X))
There are many other loss function that one can use.
Suppose we use squared error loss.
F (W ) =
!
1
(h(W, X) y(X))2
i=1
.
Now we can use standard optimization techniques to
minimize F .
If h(W, X) = W T X then this is standard linear least
squares estimation.
If we use sigmoid function then it is called logistic
regression.
There are efficient algorithms for learning linear

discriminant functions.
In learning such linear models we can use W T (X)
instead of W T X where (X) = [1 (X), , m (X)]T
as long as i are fixed functions. It is like using
zi = i (X) as features.
We would be discussing some techniques for learning
linear discriminant functions.
Beyond Linear Models
Learning linear models (classifiers) is generally

efficient.

efficient.
However, linear models are not always sufficient.
Best linear functions may still be a poor fit.

efficient.
However, linear models are not always sufficient.
Best linear functions may still be a poor fit.
How to tackle general situations?
Here are some possible viewpoints.
Neural network idea
Find a good parameterized class of nonlinear

discriminant functions
Neural network idea

Multilayer feedforward neural nets are one such class
Neural network idea

Neural network idea

Nonlinear functions are built up through composition

of summation and sigmoids.
Useful for both classification and Regression.

Decision Tree idea
Divide feature space so that a linear classifier is

enough in each region (e.g. Decision Trees).
Decision Tree idea

0.9
H1
0.8
H2
H3
H1
0.7
0.6
0.5
0.4
H2
0.3
H3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
C0
C1 C0
C1
Decision Tree idea

0.9
H1
0.8
H2
H3
H1
0.7
0.6
0.5
0.4
H2
0.3
H3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
C0
C1 C0
C1
Such tree-based models are possible for regression

also.
SVM idea
Map X nonlinearly into a high dimensional space and

try a linear discriminant function there. (e.g. Support
Vector Machines)
SVM idea

Vector Machines)
Let X = [x1 x2 ] and let : 2 5 given by
Z = (X) = [1 x1 x2 x21 x22 x1 x2 ]
SVM idea

Vector Machines)
Let X = [x1 x2 ] and let : 2 5 given by
Z = (X) = [1 x1 x2 x21 x22 x1 x2 ]
Now,
g(X) = a0 + a1 x1 + a2 x2 + a3 x21 + a4 x22 + a5 x1 x2

is a quadratic discriminant function in 2 ; but
g(Z) = a0 + a1 z1 + a2 z2 + a3 z3 + a4 z4 + a5 z5
is a linear dscriminant function in the (X) space.
Organization of this course
Bayes classifier for minimizing risk (and some

variations)
Estimation techniques for class conditional densities.
Parametric and non-parametric models
ML and Bayesian estimation
Mixture models and EM algorithm
Learning linear classifiers and regression models
Perceptron and LMS algorithm
Linear Least squares estimation.
Logistic regression.
Probabilistic graphical models (?)
Simple intro to statistical learning theory
Complexity of a learning problem VC dimension
Consistency of Empirical Risk minimization
Learning Nonlinear models Neural networks.
Feedforward networks and Backpropagation
Radial Basis function networks.

Recurrent neural networks
The issues in deep neural networks
CNN, RBM, Autoencoder models
Learning nonlinear models SVMs
Optimal Separating hyperplane
Support Vector Machine for learning optimal

hyperplanes
SVM (Kernel based) methods for classification and

regression
Feature Extraction/dimensionality reduction
Boosting and other variance reduction techniques
Multiple classifier combinations

PRNN P S Sastry Lec 1

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

PRNN P S Sastry Lec 1

Încărcat de

Drepturi de autor:

Formate disponibile

Pattern Recognition and Neural Networks

Dept. Electrical Engineering

PRNN Jan-Apr 2016 p.1/177

R.O.Duda, P.E.Hart and D.G.Stork, Pattern

C.M.Bishop, Pattern Recognition and Machine

T.Hastie, R.Tibshirani and J.Friedman The Elements

Kevin Murphy, Machine Learning: A Probabilistic

R.O.Duda and P.E. Hart, Pattern Classification and

PRNN Jan-Apr 2016 p.3/177

Reading facial expressions

Identifying a person by fingerprints

Diagnosis from medical images

PRNN Jan-Apr 2016 p.4/177

Machine Recognition of Patterns

Pattern feature extractor classifier class label

PRNN Jan-Apr 2016 p.5/177

Machine Recognition of Patterns

Pattern feature extractor classifier class label

Feature extractor makes some measurements on the

X is called Feature Vector. Often, X n .

Classifier maps each feature vector to a class label.

Features to be used are problem-specific.

PRNN Jan-Apr 2016 p.6/177

Some Examples of PR Tasks

PRNN Jan-Apr 2016 p.7/177

PRNN Jan-Apr 2016 p.8/177

Document Classification (e.g., spam detection)

PRNN Jan-Apr 2016 p.9/177

Document Classification (e.g., spam detection)

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant

PRNN Jan-Apr 2016 p.11/177

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant

Should we design features or learn them?

PRNN Jan-Apr 2016 p.12/177

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant

Should we design features or learn them?

Some techniques available to extract more relevant

PRNN Jan-Apr 2016 p.13/177

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant

Should we design features or learn them?

Some techniques available to extract more relevant

After feature extraction each pattern is a vector

Classifier is a function to map such vectors into class

PRNN Jan-Apr 2016 p.14/177

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant

Should we design features or learn them?

Some techniques available to extract more relevant

After feature extraction each pattern is a vector

Classifier is a function to map such vectors into class

PRNN Jan-Apr 2016 p.15/177

Feature Space, X Set of all possible feature vectors.

PRNN Jan-Apr 2016 p.16/177

Feature Space, X Set of all possible feature vectors.

Classifier: a decision rule or a function,

Often, X = n . Convenient to take M = 2.

PRNN Jan-Apr 2016 p.17/177

Feature Space, X Set of all possible feature vectors.

Classifier: a decision rule or a function,