Sunteți pe pagina 1din 33

# 2/21/2017

Statistical Approach to PR

Various PR approaches

1
2/21/2017

## Simple probabilistic approach

A fundamental statistical approach to the problem
of pattern classification.

## Ideal case: probability structure underlying the

categories is known perfectly.

## Specific example: Fish sorting

classifying sea bass and salmon fish appears randomly on a belt, let c denote state
of nature or class:
c = 1 for sea bass c = 2 for salmon
State of nature is unpredictable, therefore must be probabilistically described as
priori probability. Prior probabilities reflect our knowledge of how likely each type of
fish will appear before we actually see it

## A priori probabilities may be known in advance, e.g.

1)Based on from collected samples as training data,
2) Based on the time of the year,
3) Based on location
Let
P(1) is the prior probability that the ish is a sea bass
P(2) is the prior probability that the fish is a salmon Prior
Also , assuming no other fish can appear, then
P(1) + P(2) = 1 (exclusivity and exhaustivity)

2
2/21/2017

Discrimination function
Case1: That we must make a decision without seeing the fish, i.e. no features

## Priori probabilities of classes 1 and 2 : Estimate from available training data,

i.e., if N is the total number of available training patterns, and N1, N2 of them
belong to 1 and 2, respectively e.g.

## P(1 )= N1 /N = 0.9, P(2 )= N2 /N = 0.1

Discriminating function:
if P(1 ) > P(2 ) : choose 1 otherwise choose 2
P(error) = P(choose 2 | 1 ).P(1 ) + P(choose 1| 2 )P(2 ) = min [p(1 ), p(2 )]
e.g. if we always choose c 1 ...
P(error) = 0 * 0.9 + 1 * 0.1 = 0.1
i.e. Probability of error ... 10% ==> minimal error

## So it works well if P(1 ) >> P(2 ),

not well at all if P(1 ) = P(2 ) ) i.e. if they are equi-probable (uniform priors)

## Bayesian Decision Theory

The classifier based on Bayesian decision theory design
needs to integrate all the available problem information,
such as measurement, priori probabilities, likelihood
and evidence to form the decision rules.

## Decision rules formation:

a) By calculating the posteriori probability p(X|i) from
the priori probability p(i|X). : Bayesian Theory.

## a) By formulating a measure of expected classification

error or risk and choosing a decision rule that
minimizes this measure.

3
2/21/2017

Bayes' Theorem
To derive Bayes' theorem, start from the definition of conditional probability. The
probability of the event A given the event B is

## Rearranging and combining these two equations, we find

Discarding the middle term and dividing both sides by P(B), provided that neither
P(B) nor P(A) is 0, we obtain Bayes' theorem:

----------(1)

Bayes' Theorem
Bayes' theorem is often completed by noting that, according to the Law of Total
Probability
-------(1)

## where A c is the complementary event of A.

Substituting (2) into (1)

More generally, the Law of Total Probability states that given a partition or class
{Ai}, of the event space

-------(3)

4
2/21/2017

## Pattern and Class Representation

A pattern is represented by a set of d features, or
attributes, viewed as a d-dimensional feature

x ( x1 , x2 ,
vector. T
, xd )
Given a pattern x = (x1, x2, , xd)T, assign it to
one of n classes in set C.

C ={1, 2, , n}

## The States of Nature : Given

P(i): various classes probabilities (Prior Probability calculated from
training data)
P(x| i): class-conditional probabilities (Likelihood Prob. calculated
from training data)

2
1 P(2)
P(1) P(x|2)
x
P(x| 1)

3
P(3)
P(x| 3)

5
2/21/2017

Posterior Probablity
P(i|X): Posterior probability that a test pattern X belongs to class i can be given
as:

P( X | i ) * P(i )
P(i | X )
P( X )

Likelihood * Pr ior
Posterior
Evidence
i.e. To classify a test pattern with attribute vector X, we assign it to the class most
probable for X.

This means, we estimate the P(i|X) for each class i=1 to n. Then, we assign the
pattern to the class with highest posterior probability. It is just like to maximize the
P(i|X).

P( X | i ).P(i )
Bayes classifier: P(i | X )
P( X )

So, maximizing this term means to maximize the right hand side of the
above equation. All the term can be calculated from training data

## P(X ): It is constant for all the classes. So single value is needed

In case single feature x and multi-class then it is calculated as:
n
P( X x) p( x | i ). p(i )
i 1
P(X|i) : It needs huge computation to consider all the attributes. So,
assume that the attributes are independent to each other. This makes
the classifier as Nave Bayes classifier. i.e. d
P( X | i ) Pxk | i
k 1

6
2/21/2017

## Bayes Classifier for fish sorting

The simple probabilistic approach is not sufficient to classify the
fishes

## To improve classification correctness, we use features that can be

measured on fish and incorporate the Bayes classification concept.

## Assume the fish(Seabass/Salmon) be denoted by feature Scalar X

comprising single feature:
x as length (continuous random variable) feature

## Define p(x|i) as the class-conditional probability density (or

probability of x given the state of nature i). Its distribution
depends on the state of nature (class) i and feature value.

## These probability density function can be obtained by observing a

large number of training pattern samples (statistical).

## Bayes Classifier for fish sorting

p(x|1) and p(x| 2) showing the pdfs curves describe the
difference in populations of two type of classes e.g. Seabass
and Salmon fishes:
Hypothetical class-conditional
pdfs show the probability
density of measuring a part-
icular feature value x given
the pattern is in category i.

## Density functions are normalized,

thus area under each curve is 1.0

7
2/21/2017

## Bayes Classifier for fish sorting

Suppose we know prior P( i) and p(x| i) and We measure the length of a
fish as the value x.
Now, the P( i|x) as the a posteriori probability (probability of the i given
the measurement of feature value x) can be given by bayes theorem . As
P( x | i ).P(i )
2

P(i | x) Where : P( x) p( x | i ). p( i )
P( x) i 1

P(1 | x) p( 2 | x) 1
Decision Rule:

1 if P(1 | x P 2 | x 1 if p( x | 1. p(1) px | 2. p( 2)
c
2 otherise 2 otherise

## The maximization of P(i|x) will depend only on the likelihood P(x|i)

term. Coz, p(x), the evidence, is constant for all classes. Also if the classes
are equiprobable i.e. P( 1) = P( 2) = 0.5 then it can also be dropped from
comparison.

## Fig (a) : Hypothetical class-conditional Fig(b): Posterior probabilities for the

pdfs show the probability density of particular priors P( 1) = 2/3 and P( 2) = 1/3
measuring a particular feature value for fig (a).
x given the pattern is in category i.
Given that a pattern is measured to have
feature value x= 14, then
its probability for class 2 = 0.08 and
for class 1 = 0.92.

8
2/21/2017

Probability of Error
Remember that the goal is to minimize error.
whenever we observe a particular x, the probability of error is :
P(error|x) = P( 1|x) if we decide 2 in place of 1
P(error|x) = P( 2|x) if we decide 1 in place of 2
Therefore,
P(error|x) = min [P( 1|x), P( 2|x)]

## For any given x, we can minimize the error by choosing the

largest of p( 1 |x) and p( 2|x) i.e.
Decide the class 1 if
p( 1 |x) > p( 2|x)
or

## The decision boundary x B is

the border between
classes i and j, simply
where
P(i|x)=P(j|x)

Fig: Components for error with equal prior and non-optimal decision point x*.
The complete pink area (including triangled area) corresponds to the probability of
error for deciding 1 when the class is 2 and gray area corresponds to converse.

If we select the xB (as decision boundary) in place of x* then we can eliminate the
reducible error portion and minimize the error.

9
2/21/2017

Generalization:
Allowing more than two features:
replaces scalar x with a vector x from a d-dimensional feature
space Rd

## Allowing more than two classes:

deciding i for P(i|X) > P(j |X) for all i j

## Allowing other action than classification

e.g. allows not to classify if dont know class

## Introducing loss function more general than probability of error

weighting decision costs

## Allowing generalization on the probability density function:

1. Gaussian (normal) density function
2. Uniform density function

## Modeling the conditional density

Bayes classifier behavior is determined by
1) conditional density p(x|i)
2) a priori probability P(i)

## If we assume that our training data is following some standard probability

distribution function, then to estimate p(x|i) for particular xs value is easy.
i.e. no need to find first actual pdf.

## Normal (Gaussian) probability density function is very common assumption

for pdf, because it is:
Extensive studies on it say that, it is followed by nature.
Well behaved and analytically tractable
An appropriate model for continuous values randomly distributed around mean .
Suitable for multivariate modeling.
Above all, the Gaussian PDF provide the optimal Bayesian classifiers.

10
2/21/2017

## Univariate Gaussian(or Normal) function: N(, )

A univariate or single dimensional(d=1) gaussian function is defiend as:

## The pdf has roughly 95% area

in the range = |x- | 2 and
Peak has value p() = 1/

## (a) Mean value = 0 and variance 2 = 1

(b) Mean value = 1 and variance 2 = 0.2

## The larger the variance the broader the graph is.

The graphs are symmetric, and they are centered at the respective
mean value.

11
2/21/2017

## Multivariate Gaussian(or Normal) function: N(, )

A multivariate or d-dimensional gaussian function is defined as:

## Case1: Single feature and Two classes (equiprobable)

problem with Gaussian pdf:
Nos. of features d = 1 (x) Nos. of Classes: C = 2
Prior probabilities: P(1) = p(2) = 0.5 # Same for both classes for simplicity
Classes variances: 1 = 2 = # Same for both classes
Classes means: 1 2 # Different for both classes

Posterior probability:

Since p(x) is not taken into account, because it is the same for all classes and it does not
affect the decision. Furthermore, if the a priori probabilities are equal, then, its also
does not affect the comparison of two posterior values.

So, Decision Rule: max [ p(x|1), p(x|2)] # Also called maximum likelihood rule

i.e. The search for the maximum now rests on the values of the conditional pdfs
evaluated at x.

12
2/21/2017

## Case1: Single feature and Two classes problem

with Gaussian pdf:
Line at x 0 is a threshold partitioning
the feature space into two regions,
R1 and R2. Decision boundary

## The total probability of committing

a decision error for the case of two
equiprobable classes, is given by:

## Case1: Single feature and Two classes (not equiprobable)

problem with Gaussian pdf:
Decision boundary is based on the prior probability, coz, it is the part of posterior
probability.

13
2/21/2017

## Case1: Two features and Two classes problem with

Gaussian pdf:
To improve classification correctness more, we
add one more feature to our fish sorting.

## Assume the fish(Seabass/Salmon) be denoted

by feature vector X comprising single feature
x1 as length
x2 as lightness intensity

## Define p(x1, x2|i) as the class-conditional

probability density function for both classes.

## These probability density function can be

obtained by observing a large number of
training pattern samples (statistical). Fig: It shows the probability density
distribution functions in 2D (two
By extending the bayes theorem for multi features) for classes 1 and 2. Since
features and multi-class problem we can we have included two features so, the
design the decision rule for our fish sorting decision boundaries are more clear
problem. resulting in minimizing the error.

## Estimation of Discriminating function for multivariate

Gaussian pdf
The general a multivariate gaussian pdf is defined as:

## Having class-specific mean vectors i and class-specific covariance matrices i, a

class-dependent gaussian density functions p(x|i) can be given as:

-------(1)

14
2/21/2017

## Estimation of Discriminating function for multivariate

Gaussian pdf
For classification: largest discrimination function for ith class can be given in term of
Posterior probability maximization as:

## gi(x) = p(i|x) = p(x|i).p(i))/p(x)

Since P(x) is constant for all classes we can drop the term p(x), So

gi(x) = p(x|i).p(i)

## Because of the exponential form of the involved densities, logarithmic function

log() can be used for alternative discriminating function for ith class as:
gi(x) = log(p(x|i) .p(i))

## Discriminating function for multivariate Gaussian pdf

From equations 1 and 2 we find:

----(3)

----(4)

## Here i influences classification.

Now we have following cases:

Case1:
i = = 2 . I (I stands for identity matrix : diagonal case) i.e. equal covariance
matrix for all classes and proportional to I.
Case2:
i = =arbitrary covariance matrix but identical for all classes

Case3:
i = arbitrary covariance matrix & not identical for all classes.

15
2/21/2017

## Discriminating function for multivariate Gaussian pdf

Case1:
i = = 2. I (I stands for identity matrix : diagonal case) i.e. equal
covariance matrix for all classes and proportional to I.
-- --(4)

Assuming equal covariance matrices (i.e. class is only through the mean
vectors i ). We can drop 2nd term as class-independent constant
biases, So,

-------(5)

## But influences classification through the first term, which is squared

distance of feature vector x from the ith mean vector i weighted by the
inverse of the covariance matrix -1 .

## Discriminating function for multivariate Gaussian pdf

Assuming features are statistically independent and each feature has the same variance 2.
Taking the following and putting in eq#5, we can write eq#6 as Simple discriminant function:

1 12 .I
x i
2

( x i ) .( x i ) x i
T 2 g i ( x) log( p(i )) --------(6)
2 2
If P(i) are the same for all c classes, then the log(P(i)) terms becomes another unimportant
additive constant that can be ignored.
x i
2

g i ( x)
2 2

## If = 1 then = I, which result in Euclidean space. So Euclidean Distance can be taken as

discriminating factor. results. i.e. the gi (x) will be largest if the distance will be minimum.
The optimal decision rule:

To classify a feature vector x, measure the Euclidean distance || x- i|| 2 in between x and
each of the mean vector. And assign x to the category of the nearest means class.
Such classifier is called a minimum-distance classifier.

16
2/21/2017

## Discriminant function as Linear discriminant function :

Proof
The Bayesian discriminant function is given as:

x i
2

g i ( x)
2 2

x i xT .x 2.i x .i .i
2 T T
Since,

xT .x 2.i x .i .i
T T
gi( x)
then, 2 2
The quadratic term xTx is the same for all i, making it an ignorable additive constant.
Thus we obtain the equivalent linear discriminant function.

where, wi 0 21 2 i .i and
T

wi 2i

17
2/21/2017

## Decision boundaries in case For i = = 2. I

The decision surfaces for a linear classifier are pieces of hyperplanes
defined by the linear equations gi(x) = gj(x) for the two categories with
the highest posterior Probabilities.
If x0 is threshold then, gi(x0)- gj(x0)=0

x i
2
Since,
g i ( x) log( p(i ))
2 2

So,

If P(i) = P(j), then point x0 is halfway between the means, and the
hyperplane is the perpendicular bisector of the line between the
means.

18
2/21/2017

## If covariances of two distributions are equal and proportional to the identity

matrix, then the distributions are hyperspherical in d dimensions, and the
boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line
separating the means.

## Decision boundaries for i = 2.I and P(1) (2)

If P( i) P( j), then point x0 shifts away from the more likely mean.

## If the variance 2 is small relative to the squared distance ||i-j||2,

then the position of the decision boundary x0 is relatively insensitive
to the exact values of the prior probabilities P().

1-D Case

19
2/21/2017

2-D Case

3-D Case

20
2/21/2017

## Discriminating function for multivariate Gaussian pdf

Case2: For i = =arbitrary covariance matrix but identical for each class
--------(4)

Due to identical covariance matrix, we can drop 2nd term and get:

--------(6)

If the prior probabilities P( i) are the same for all classes, then
logP(i) term can be ignored.

In this case the influence of covariance matrix will affect the distance
measurement, so this is not the simple Euclidean distance but it is
called as Mahalanobis distance from feature vector x to i. It is given
as:

21
2/21/2017

## Decision boundaries in case For i = Arbitrary but

identical for all classes

2-D Case

22
2/21/2017

3-D Case

## Discriminating function for multivariate Gaussian pdf

Case3: For i = arbitrary covariance matrix & not identical for each class but
Invertable.
In the general multivariate normal case, the covariance matrices are different for
each catergory. The discriminating function :

## Nothing can be dropped now from above equation.

Also, due to arbitrary and class specific covariance matrix in first term, the resulting
discriminant functions are inherently quadratic. By expanding the first term we get:

23
2/21/2017

Decision boundaries
In the two-category case, the decision surface are hyperquadrics,
which can assume any of the general form of various type as:

1) Hyperplanes,
2) Pairs of hyperplanes,
3) Hyperspheres,
4) Hyperellipsoids,
5) Hyperparaboloids, and
6) Hyperhyperboloids

1-D Case

## Fig: Non-simply connected decision regions can arise in one

dimensions for Gaussians having unequal variance.

24
2/21/2017

2-D Case

Fig: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general
distributions whose Bayes decision boundary is that hyperquadric.

3-D Case

25
2/21/2017

## 2-features and 4-classes Case

Fig: The decision regions for four normal distributions. Even with such a low
number of categories, the shapes of the boundary regions can be rather complex.

26
2/21/2017

## Introducing loss function more general

than probability of error.
In classification problem we face the following situations

## In simple cases, the error due to any classification mistakes are

equally costly.

In some other cases, error due classification mistakes are not equally
costly e.g. some mistakes lead to more costlier than others. e.g.

## In RADAR signal analysis problem, two classes correspond to valid-

signal (e.g. a message or perhaps a radar return) and noise. Here,
our attempt is to develop strategies so that the incoming signal is
correctly classified.

27
2/21/2017

Case1: The incoming signal is a valid-signal and we correctly
classify it as such. ( a hit : ) 1-> 1

## Case2: The incoming signal is a valid-signal and we incorrectly

classify it as noise. ( a miss) 1-> 2

## Case3: The incoming signal is a noise and we incorrectly classify

it as valid signal. (a false alarm) 2 -> 1

## Case4: The incoming signal is a noise and we correctly classify it

as noise. ( a true reject) 2 -> 2

Suppose after classification, we need to take some action as to fire a
missile or not.
In this case, we can assume that, action associated with wrong
classification is more costlier than right classification.

let = denote the action or decision taken as per the above four
cases.
e.g. 1 = fire a missile if signal (1)
2 = Not to fire a missile if noise (2)

## So, this example leads to define the probability or error in new

manner.

28
2/21/2017

Let
X be d-dimensional vector and [1, 2 , n] finite set of n classes

## [ 1, 2, n] finite set of n possible actions associated with each

class. Where i = the decision to choose class i
A decision rule: A function (x) e.g. mapping (x) -> i
( i|j) is a loss or cost function: estimating the loss or cost resulting
from taking action i when the class is j.
ij be the cost associated for selecting action i when the class j.
E.g. in case c = 2 (nos. of classes)
11 and 22 are the rewards for correct classification and
12 and 21 are the penalty for incorrect classification.
Risk R: In decision theoretic terminology, the expected loss is called
the Risk. Where R denote the over all risk.
R( i|x): It is conditional risk depend upon the action selected.

## Estimation and Minimization of Risk

For n classes problem let X be classified as belonging to class k , then the probability
of error as per bayes rule : n
P(error | X ) P( | X )
i 1( i k )
i

But this bayes formulation for error does not describe the risk. So for risk:

The probability that we select i action, when j is the true class is given as:
P(ij) = P(i|j).P(j)

Since the term P(i|j) depends on the chosen mapping (x)-> i, which in turn
depends on x. Then the conditional risk is given as:
n n
R( i | x) ( i | j ). p( i | j ). p( j ) . p(
ij j | x)
j1 j1

29
2/21/2017

## Estimation and Minimization of Risk

If c = 2 and Nos. of actions are 2 then the overall risk measure is:
R = 11.P(1|1).P(1) + 12. P(1|2).P(2)
+ 21.P(2|1).P(1) + 22.P(2|2).P(2)
Where,
For action 1, the measure of conditional risk:
R[(x)-> 1] = R(1|x) = 11.P(1|x) + 12.P(2|x)

## For action 2, the measure of conditional risk:

R[(x)-> 2] = R(2|x) = 21.P(1|x) + 22.P(2|x)

## For n classes, the expected risk R[(x)] that x will be classified

incorrectly (using the total probability theorem):

## Estimation and Minimization of Risk

For minimizing the overall risk, we have to minimize the conditional
risk. And the lower bound for this minimization is the Bayes risk.

## Minimizing the R(i|x) for n = 2 classes problem:

Two actions 1 and 2 are to be taken for 1 and 2 respectively.
The decision rule is formulated as: decide the action 1 or class 1 if:

-----------(1)

30
2/21/2017

## Estimation and Minimization of Risk

So alternatively under the assumption 21 > 11 i.e. the cost for
incorrect decision is higher then the correct one, we can write
decision rule as: Decide 1 if

------------(2)

## Eq.2 is in the form of likelihood ratio in term of ratio of prior

probabilities [p(2)/p(1)] weighted by .

## Thus, the Bayes decision rule can be interpreted as calling for

deciding 1 if the likelihood ratio exceeds a threshold value that
is independent of the observation x.

## Estimation and Minimization of Risk

Effect of ij in case of n = 2 classes

## Suppose p(1) = p(2) = 0.5, and

11 = -2, 22 = -1 : for correct
12 = 2, 21 = 4 : for incorrect

becomes:

## It shows that we have significant concern with correctly

identifying the class 1 i.e signal than noise.

31
2/21/2017

## Estimation and Minimization of Risk: Simplest case

Often we choose:
11 = 22 = 0, i.e. no cost or risk for correct classification
12 = 21 = 1, i.e. unit cost or risk for incorrect classification
So, in this case equation-2 becomes:

## This is the same as we established earlier in case of

minimization of probability of error.

## Estimation and Minimization of Risk: Generalization

For n classes problem to decide the cost for correct and incorrect
classification we can define the cost/loss function as a zero-one loss
function:

## The risk corresponding to this loss function is precisely the average

probability of error, since the conditional risk for decision i is now
given as:

32
2/21/2017

## Thus, to minimize the average probability of error, we

should select the i that maximizes the posterior probability
P(i|x).

Thanks

33