Ch2 Stat Approach PDF

2/21/2017
Statistical Approach to PR
Various PR approaches
1
2/21/2017
Simple probabilistic approach

A fundamental statistical approach to the problem
of pattern classification.
Decision problem is posed in probabilistic terms
Ideal case: probability structure underlying the

categories is known perfectly.
Specific example: Fish sorting

classifying sea bass and salmon fish appears randomly on a belt, let c denote state
of nature or class:
c = 1 for sea bass c = 2 for salmon
State of nature is unpredictable, therefore must be probabilistically described as
priori probability. Prior probabilities reflect our knowledge of how likely each type of
fish will appear before we actually see it
A priori probabilities may be known in advance, e.g.

1)Based on from collected samples as training data,
2) Based on the time of the year,
3) Based on location
Let
P(1) is the prior probability that the ish is a sea bass
P(2) is the prior probability that the fish is a salmon Prior
Also , assuming no other fish can appear, then
P(1) + P(2) = 1 (exclusivity and exhaustivity)
2
2/21/2017
Discrimination function
Case1: That we must make a decision without seeing the fish, i.e. no features
Priori probabilities of classes 1 and 2 : Estimate from available training data,

i.e., if N is the total number of available training patterns, and N1, N2 of them
belong to 1 and 2, respectively e.g.
P(1 )= N1 /N = 0.9, P(2 )= N2 /N = 0.1
Discriminating function:
if P(1 ) > P(2 ) : choose 1 otherwise choose 2
P(error) = P(choose 2 | 1 ).P(1 ) + P(choose 1| 2 )P(2 ) = min [p(1 ), p(2 )]
e.g. if we always choose c 1 ...
P(error) = 0 * 0.9 + 1 * 0.1 = 0.1
i.e. Probability of error ... 10% ==> minimal error
So it works well if P(1 ) >> P(2 ),

not well at all if P(1 ) = P(2 ) ) i.e. if they are equi-probable (uniform priors)
Bayesian Decision Theory

The classifier based on Bayesian decision theory design
needs to integrate all the available problem information,
such as measurement, priori probabilities, likelihood
and evidence to form the decision rules.
Decision rules formation:

a) By calculating the posteriori probability p(X|i) from
the priori probability p(i|X). : Bayesian Theory.
a) By formulating a measure of expected classification

error or risk and choosing a decision rule that
minimizes this measure.
3
2/21/2017
Bayes' Theorem
To derive Bayes' theorem, start from the definition of conditional probability. The
probability of the event A given the event B is
Equivalently, the probability of the event B given the event A is
Rearranging and combining these two equations, we find
Discarding the middle term and dividing both sides by P(B), provided that neither
P(B) nor P(A) is 0, we obtain Bayes' theorem:
----------(1)
Bayes' Theorem
Bayes' theorem is often completed by noting that, according to the Law of Total
Probability
-------(1)
where A c is the complementary event of A.

Substituting (2) into (1)
More generally, the Law of Total Probability states that given a partition or class
{Ai}, of the event space
Therefore, for any partition Ai
-------(3)
4
2/21/2017
Pattern and Class Representation

A pattern is represented by a set of d features, or
attributes, viewed as a d-dimensional feature
x ( x1 , x2 ,
vector. T
, xd )
Given a pattern x = (x1, x2, , xd)T, assign it to
one of n classes in set C.
C ={1, 2, , n}
The States of Nature : Given

P(i): various classes probabilities (Prior Probability calculated from
training data)
P(x| i): class-conditional probabilities (Likelihood Prob. calculated
from training data)
2
1 P(2)
P(1) P(x|2)
x
P(x| 1)
3
P(3)
P(x| 3)
5
2/21/2017
Posterior Probablity
P(i|X): Posterior probability that a test pattern X belongs to class i can be given
as:
P( X | i ) * P(i )
P(i | X )
P( X )
Likelihood * Pr ior
Posterior
Evidence
i.e. To classify a test pattern with attribute vector X, we assign it to the class most
probable for X.
This means, we estimate the P(i|X) for each class i=1 to n. Then, we assign the
pattern to the class with highest posterior probability. It is just like to maximize the
P(i|X).
P( X | i ).P(i )
Bayes classifier: P(i | X )
P( X )
So, maximizing this term means to maximize the right hand side of the
above equation. All the term can be calculated from training data
P(i) : (Nos. of instances of a class)/(Total samples)
P(X ): It is constant for all the classes. So single value is needed

In case single feature x and multi-class then it is calculated as:
n
P( X x) p( x | i ). p(i )
i 1
P(X|i) : It needs huge computation to consider all the attributes. So,
assume that the attributes are independent to each other. This makes
the classifier as Nave Bayes classifier. i.e. d
P( X | i ) Pxk | i
k 1
6
2/21/2017
Bayes Classifier for fish sorting

The simple probabilistic approach is not sufficient to classify the
fishes
To improve classification correctness, we use features that can be

measured on fish and incorporate the Bayes classification concept.
Assume the fish(Seabass/Salmon) be denoted by feature Scalar X

comprising single feature:
x as length (continuous random variable) feature
Define p(x|i) as the class-conditional probability density (or

probability of x given the state of nature i). Its distribution
depends on the state of nature (class) i and feature value.
These probability density function can be obtained by observing a

large number of training pattern samples (statistical).

p(x|1) and p(x| 2) showing the pdfs curves describe the
difference in populations of two type of classes e.g. Seabass
and Salmon fishes:
Hypothetical class-conditional
pdfs show the probability
density of measuring a part-
icular feature value x given
the pattern is in category i.
Density functions are normalized,

thus area under each curve is 1.0
7
2/21/2017

Suppose we know prior P( i) and p(x| i) and We measure the length of a
fish as the value x.
Now, the P( i|x) as the a posteriori probability (probability of the i given
the measurement of feature value x) can be given by bayes theorem . As
P( x | i ).P(i )
2
P(i | x) Where : P( x) p( x | i ). p( i )
P( x) i 1
P(1 | x) p( 2 | x) 1
Decision Rule:
1 if P(1 | x P 2 | x 1 if p( x | 1. p(1) px | 2. p( 2)
c
2 otherise 2 otherise
The maximization of P(i|x) will depend only on the likelihood P(x|i)

term. Coz, p(x), the evidence, is constant for all classes. Also if the classes
are equiprobable i.e. P( 1) = P( 2) = 0.5 then it can also be dropped from
comparison.
Posterior probability for particular priors
Fig (a) : Hypothetical class-conditional Fig(b): Posterior probabilities for the

pdfs show the probability density of particular priors P( 1) = 2/3 and P( 2) = 1/3
measuring a particular feature value for fig (a).
x given the pattern is in category i.
Given that a pattern is measured to have
feature value x= 14, then
its probability for class 2 = 0.08 and
for class 1 = 0.92.
8
2/21/2017
Probability of Error
Remember that the goal is to minimize error.
whenever we observe a particular x, the probability of error is :
P(error|x) = P( 1|x) if we decide 2 in place of 1
P(error|x) = P( 2|x) if we decide 1 in place of 2
Therefore,
P(error|x) = min [P( 1|x), P( 2|x)]
For any given x, we can minimize the error by choosing the

largest of p( 1 |x) and p( 2|x) i.e.
Decide the class 1 if
p( 1 |x) > p( 2|x)
or
Deciding the decision boundary
The decision boundary x B is

the border between
classes i and j, simply
where
P(i|x)=P(j|x)
Fig: Components for error with equal prior and non-optimal decision point x*.
The complete pink area (including triangled area) corresponds to the probability of
error for deciding 1 when the class is 2 and gray area corresponds to converse.
If we select the xB (as decision boundary) in place of x* then we can eliminate the
reducible error portion and minimize the error.
9
2/21/2017
Generalization:
Allowing more than two features:
replaces scalar x with a vector x from a d-dimensional feature
space Rd
Allowing more than two classes:

deciding i for P(i|X) > P(j |X) for all i j
Allowing other action than classification

e.g. allows not to classify if dont know class
Introducing loss function more general than probability of error

weighting decision costs
Allowing generalization on the probability density function:

1. Gaussian (normal) density function
2. Uniform density function
Modeling the conditional density

Bayes classifier behavior is determined by
1) conditional density p(x|i)
2) a priori probability P(i)
If we assume that our training data is following some standard probability

distribution function, then to estimate p(x|i) for particular xs value is easy.
i.e. no need to find first actual pdf.
Normal (Gaussian) probability density function is very common assumption

for pdf, because it is:
Extensive studies on it say that, it is followed by nature.
Well behaved and analytically tractable
An appropriate model for continuous values randomly distributed around mean .
Suitable for multivariate modeling.
Above all, the Gaussian PDF provide the optimal Bayesian classifiers.
10
2/21/2017
Univariate Gaussian(or Normal) function: N(, )

A univariate or single dimensional(d=1) gaussian function is defiend as:
The pdf has roughly 95% area

in the range = |x- | 2 and
Peak has value p() = 1/
Univariate Gaussian(or Normal) function: N(, )
(a) Mean value = 0 and variance 2 = 1

(b) Mean value = 1 and variance 2 = 0.2
The larger the variance the broader the graph is.

The graphs are symmetric, and they are centered at the respective
mean value.
11
2/21/2017
Multivariate Gaussian(or Normal) function: N(, )

A multivariate or d-dimensional gaussian function is defined as:
Case1: Single feature and Two classes (equiprobable)

problem with Gaussian pdf:
Nos. of features d = 1 (x) Nos. of Classes: C = 2
Prior probabilities: P(1) = p(2) = 0.5 # Same for both classes for simplicity
Classes variances: 1 = 2 = # Same for both classes
Classes means: 1 2 # Different for both classes
Posterior probability:
Since p(x) is not taken into account, because it is the same for all classes and it does not
affect the decision. Furthermore, if the a priori probabilities are equal, then, its also
does not affect the comparison of two posterior values.
So, Decision Rule: max [ p(x|1), p(x|2)] # Also called maximum likelihood rule
i.e. The search for the maximum now rests on the values of the conditional pdfs
evaluated at x.
12
2/21/2017
Case1: Single feature and Two classes problem

with Gaussian pdf:
Line at x 0 is a threshold partitioning
the feature space into two regions,
R1 and R2. Decision boundary
The total probability of committing

a decision error for the case of two
equiprobable classes, is given by:
Case1: Single feature and Two classes (not equiprobable)

problem with Gaussian pdf:
Decision boundary is based on the prior probability, coz, it is the part of posterior
probability.
13
2/21/2017
Case1: Two features and Two classes problem with

Gaussian pdf:
To improve classification correctness more, we
add one more feature to our fish sorting.
Assume the fish(Seabass/Salmon) be denoted

by feature vector X comprising single feature
x1 as length
x2 as lightness intensity
Define p(x1, x2|i) as the class-conditional

probability density function for both classes.
These probability density function can be

obtained by observing a large number of
training pattern samples (statistical). Fig: It shows the probability density
distribution functions in 2D (two
By extending the bayes theorem for multi features) for classes 1 and 2. Since
features and multi-class problem we can we have included two features so, the
design the decision rule for our fish sorting decision boundaries are more clear
problem. resulting in minimizing the error.
Estimation of Discriminating function for multivariate

Gaussian pdf
The general a multivariate gaussian pdf is defined as:
Having class-specific mean vectors i and class-specific covariance matrices i, a

class-dependent gaussian density functions p(x|i) can be given as:
-------(1)
14
2/21/2017
Estimation of Discriminating function for multivariate

Gaussian pdf
For classification: largest discrimination function for ith class can be given in term of
Posterior probability maximization as:
gi(x) = p(i|x) = p(x|i).p(i))/p(x)
Since P(x) is constant for all classes we can drop the term p(x), So
gi(x) = p(x|i).p(i)
Because of the exponential form of the involved densities, logarithmic function

log() can be used for alternative discriminating function for ith class as:
gi(x) = log(p(x|i) .p(i))
= log(p(x|i ) )+ log(p(i)) ---------(2)
Discriminating function for multivariate Gaussian pdf

From equations 1 and 2 we find:
----(3)
Dropping second term straightforward, due class independency, we get,
----(4)
Here i influences classification.

Now we have following cases:
Case1:
i = = 2 . I (I stands for identity matrix : diagonal case) i.e. equal covariance
matrix for all classes and proportional to I.
Case2:
i = =arbitrary covariance matrix but identical for all classes
Case3:
i = arbitrary covariance matrix & not identical for all classes.
15
2/21/2017

Case1:
i = = 2. I (I stands for identity matrix : diagonal case) i.e. equal
covariance matrix for all classes and proportional to I.
-- --(4)
Assuming equal covariance matrices (i.e. class is only through the mean
vectors i ). We can drop 2nd term as class-independent constant
biases, So,
-------(5)
But influences classification through the first term, which is squared

distance of feature vector x from the ith mean vector i weighted by the
inverse of the covariance matrix -1 .

Assuming features are statistically independent and each feature has the same variance 2.
Taking the following and putting in eq#5, we can write eq#6 as Simple discriminant function:
1 12 .I
x i
2
( x i ) .( x i ) x i
T 2 g i ( x) log( p(i )) --------(6)
2 2
If P(i) are the same for all c classes, then the log(P(i)) terms becomes another unimportant
additive constant that can be ignored.
x i
2
g i ( x)
2 2
If = 1 then = I, which result in Euclidean space. So Euclidean Distance can be taken as

discriminating factor. results. i.e. the gi (x) will be largest if the distance will be minimum.
The optimal decision rule:
To classify a feature vector x, measure the Euclidean distance || x- i|| 2 in between x and
each of the mean vector. And assign x to the category of the nearest means class.
Such classifier is called a minimum-distance classifier.
16
2/21/2017
Discriminant function as Linear discriminant function :

Proof
The Bayesian discriminant function is given as:
x i
2
g i ( x)
2 2
x i xT .x 2.i x .i .i
2 T T
Since,
xT .x 2.i x .i .i
T T
gi( x)
then, 2 2
The quadratic term xTx is the same for all i, making it an ignorable additive constant.
Thus we obtain the equivalent linear discriminant function.
where, wi 0 21 2 i .i and
T
wi 2i
17
2/21/2017
Decision boundaries in case For i = = 2. I

The decision surfaces for a linear classifier are pieces of hyperplanes
defined by the linear equations gi(x) = gj(x) for the two categories with
the highest posterior Probabilities.
If x0 is threshold then, gi(x0)- gj(x0)=0
x i
2
Since,
g i ( x) log( p(i ))
2 2
So,
If P(i) = P(j), then point x0 is halfway between the means, and the
hyperplane is the perpendicular bisector of the line between the
means.
18
2/21/2017
Decision boundaries for i = 2.I and P(1) = (2)
1-D Case 2-D Case 3-D Case
If covariances of two distributions are equal and proportional to the identity

matrix, then the distributions are hyperspherical in d dimensions, and the
boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line
separating the means.
In the 3-dimensional case, the grid plane separates R1 from R2.
Decision boundaries for i = 2.I and P(1) (2)

If P( i) P( j), then point x0 shifts away from the more likely mean.
If the variance 2 is small relative to the squared distance ||i-j||2,

then the position of the decision boundary x0 is relatively insensitive
to the exact values of the prior probabilities P().
1-D Case
19
2/21/2017
2-D Case
3-D Case
20
2/21/2017

Case2: For i = =arbitrary covariance matrix but identical for each class
--------(4)
Due to identical covariance matrix, we can drop 2nd term and get:
--------(6)
If the prior probabilities P( i) are the same for all classes, then
logP(i) term can be ignored.
In this case the influence of covariance matrix will affect the distance
measurement, so this is not the simple Euclidean distance but it is
called as Mahalanobis distance from feature vector x to i. It is given
as:
On the basis Mahalanobis distance, we can classify the patterns.
21
2/21/2017
Decision boundaries in case For i = Arbitrary but

identical for all classes
2-D Case
22
2/21/2017
3-D Case

Case3: For i = arbitrary covariance matrix & not identical for each class but
Invertable.
In the general multivariate normal case, the covariance matrices are different for
each catergory. The discriminating function :
Nothing can be dropped now from above equation.
Also, due to arbitrary and class specific covariance matrix in first term, the resulting
discriminant functions are inherently quadratic. By expanding the first term we get:
23
2/21/2017
Decision boundaries
In the two-category case, the decision surface are hyperquadrics,
which can assume any of the general form of various type as:
1) Hyperplanes,
2) Pairs of hyperplanes,
3) Hyperspheres,
4) Hyperellipsoids,
5) Hyperparaboloids, and
6) Hyperhyperboloids
Even in 1-D, the decision regions need not be simply connected.
1-D Case
Fig: Non-simply connected decision regions can arise in one

dimensions for Gaussians having unequal variance.
24
2/21/2017
2-D Case
Fig: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general
hyperquadrics. Conversely, given any hyperquadratic, one can find two Gaussian
distributions whose Bayes decision boundary is that hyperquadric.
3-D Case
Fig: Arbitrary three-dimensional Gaussian distributions yield Bayes decision

boundaries that are two-dimensional hyperquadrics.
25
2/21/2017
2-features and 4-classes Case
Fig: The decision regions for four normal distributions. Even with such a low
number of categories, the shapes of the boundary regions can be rather complex.
26
2/21/2017
Introducing loss function more general

than probability of error.
In classification problem we face the following situations
In simple cases, the error due to any classification mistakes are

equally costly.
In some other cases, error due classification mistakes are not equally
costly e.g. some mistakes lead to more costlier than others. e.g.
In RADAR signal analysis problem, two classes correspond to valid-

signal (e.g. a message or perhaps a radar return) and noise. Here,
our attempt is to develop strategies so that the incoming signal is
correctly classified.
27
2/21/2017
RADAR signal analysis problem

Case1: The incoming signal is a valid-signal and we correctly
classify it as such. ( a hit : ) 1-> 1
Case2: The incoming signal is a valid-signal and we incorrectly

classify it as noise. ( a miss) 1-> 2
Case3: The incoming signal is a noise and we incorrectly classify

it as valid signal. (a false alarm) 2 -> 1
Case4: The incoming signal is a noise and we correctly classify it

as noise. ( a true reject) 2 -> 2

Suppose after classification, we need to take some action as to fire a
missile or not.
In this case, we can assume that, action associated with wrong
classification is more costlier than right classification.
let = denote the action or decision taken as per the above four
cases.
e.g. 1 = fire a missile if signal (1)
2 = Not to fire a missile if noise (2)
So, this example leads to define the probability or error in new

manner.
28
2/21/2017

Let
X be d-dimensional vector and [1, 2 , n] finite set of n classes
[ 1, 2, n] finite set of n possible actions associated with each

class. Where i = the decision to choose class i
A decision rule: A function (x) e.g. mapping (x) -> i
( i|j) is a loss or cost function: estimating the loss or cost resulting
from taking action i when the class is j.
ij be the cost associated for selecting action i when the class j.
E.g. in case c = 2 (nos. of classes)
11 and 22 are the rewards for correct classification and
12 and 21 are the penalty for incorrect classification.
Risk R: In decision theoretic terminology, the expected loss is called
the Risk. Where R denote the over all risk.
R( i|x): It is conditional risk depend upon the action selected.
Estimation and Minimization of Risk

For n classes problem let X be classified as belonging to class k , then the probability
of error as per bayes rule : n
P(error | X ) P( | X )
i 1( i k )
i
But this bayes formulation for error does not describe the risk. So for risk:
The probability that we select i action, when j is the true class is given as:
P(ij) = P(i|j).P(j)
Since the term P(i|j) depends on the chosen mapping (x)-> i, which in turn
depends on x. Then the conditional risk is given as:
n n
R( i | x) ( i | j ). p( i | j ). p( j ) . p(
ij j | x)
j1 j1
29
2/21/2017

If c = 2 and Nos. of actions are 2 then the overall risk measure is:
R = 11.P(1|1).P(1) + 12. P(1|2).P(2)
+ 21.P(2|1).P(1) + 22.P(2|2).P(2)
Where,
For action 1, the measure of conditional risk:
R[(x)-> 1] = R(1|x) = 11.P(1|x) + 12.P(2|x)
For action 2, the measure of conditional risk:

R[(x)-> 2] = R(2|x) = 21.P(1|x) + 22.P(2|x)
For n classes, the expected risk R[(x)] that x will be classified

incorrectly (using the total probability theorem):

For minimizing the overall risk, we have to minimize the conditional
risk. And the lower bound for this minimization is the Bayes risk.
Minimizing the R(i|x) for n = 2 classes problem:

Two actions 1 and 2 are to be taken for 1 and 2 respectively.
The decision rule is formulated as: decide the action 1 or class 1 if:
In term of risk comparison
In term of prior prob.

-----------(1)
30
2/21/2017

So alternatively under the assumption 21 > 11 i.e. the cost for
incorrect decision is higher then the correct one, we can write
decision rule as: Decide 1 if
------------(2)
Eq.2 is in the form of likelihood ratio in term of ratio of prior

probabilities [p(2)/p(1)] weighted by .
Thus, the Bayes decision rule can be interpreted as calling for

deciding 1 if the likelihood ratio exceeds a threshold value that
is independent of the observation x.

Effect of ij in case of n = 2 classes
Suppose p(1) = p(2) = 0.5, and

11 = -2, 22 = -1 : for correct
12 = 2, 21 = 4 : for incorrect
It shows a miss is twice as costly as a false alarm, so equation 2

becomes:
It shows that we have significant concern with correctly

identifying the class 1 i.e signal than noise.
31
2/21/2017
Estimation and Minimization of Risk: Simplest case

Often we choose:
11 = 22 = 0, i.e. no cost or risk for correct classification
12 = 21 = 1, i.e. unit cost or risk for incorrect classification
So, in this case equation-2 becomes:
This is the same as we established earlier in case of

minimization of probability of error.
Estimation and Minimization of Risk: Generalization

For n classes problem to decide the cost for correct and incorrect
classification we can define the cost/loss function as a zero-one loss
function:
It shows all errors are equally costly.
The risk corresponding to this loss function is precisely the average

probability of error, since the conditional risk for decision i is now
given as:
32
2/21/2017
Estimation and Minimization of Risk: Generalization
Thus, to minimize the average probability of error, we

should select the i that maximizes the posterior probability
P(i|x).
In other words, for minimum error rate:
Thanks
33

Ch2 Stat Approach PDF

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Ch2 Stat Approach PDF

Încărcat de

Drepturi de autor:

Formate disponibile

2/21/2017

Simple probabilistic approach

Decision problem is posed in probabilistic terms

Ideal case: probability structure underlying the

Specific example: Fish sorting

A priori probabilities may be known in advance, e.g.

Priori probabilities of classes 1 and 2 : Estimate from available training data,

P(1 )= N1 /N = 0.9, P(2 )= N2 /N = 0.1

So it works well if P(1 ) >> P(2 ),

Bayesian Decision Theory

Decision rules formation:

a) By formulating a measure of expected classification

Equivalently, the probability of the event B given the event A is

Rearranging and combining these two equations, we find

where A c is the complementary event of A.

Therefore, for any partition Ai

Pattern and Class Representation

The States of Nature : Given

P(i) : (Nos. of instances of a class)/(Total samples)

P(X ): It is constant for all the classes. So single value is needed

Bayes Classifier for fish sorting

To improve classification correctness, we use features that can be

Assume the fish(Seabass/Salmon) be denoted by feature Scalar X

Define p(x|i) as the class-conditional probability density (or

These probability density function can be obtained by observing a

Bayes Classifier for fish sorting

Density functions are normalized,

Bayes Classifier for fish sorting

The maximization of P(i|x) will depend only on the likelihood P(x|i)

Posterior probability for particular priors

Fig (a) : Hypothetical class-conditional Fig(b): Posterior probabilities for the

For any given x, we can minimize the error by choosing the

Deciding the decision boundary

The decision boundary x B is

Allowing more than two classes:

Allowing other action than classification

Introducing loss function more general than probability of error

Allowing generalization on the probability density function:

Modeling the conditional density

If we assume that our training data is following some standard probability

Normal (Gaussian) probability density function is very common assumption

Univariate Gaussian(or Normal) function: N(, )

The pdf has roughly 95% area

Univariate Gaussian(or Normal) function: N(, )

(a) Mean value = 0 and variance 2 = 1

The larger the variance the broader the graph is.

Multivariate Gaussian(or Normal) function: N(, )

Case1: Single feature and Two classes (equiprobable)

Case1: Single feature and Two classes problem

The total probability of committing

Case1: Single feature and Two classes (not equiprobable)

Case1: Two features and Two classes problem with

Assume the fish(Seabass/Salmon) be denoted

Define p(x1, x2|i) as the class-conditional

These probability density function can be

Estimation of Discriminating function for multivariate

Having class-specific mean vectors i and class-specific covariance matrices i, a

Estimation of Discriminating function for multivariate

gi(x) = p(i|x) = p(x|i).p(i))/p(x)

Because of the exponential form of the involved densities, logarithmic function

= log(p(x|i ) )+ log(p(i)) ---------(2)

Discriminating function for multivariate Gaussian pdf

Dropping second term straightforward, due class independency, we get,

Here i influences classification.