0 Voturi pozitive0 Voturi negative

12 (de) vizualizări33 paginiStatistical Approach To Pattern Recognition

May 01, 2017

© © All Rights Reserved

PDF, TXT sau citiți online pe Scribd

Statistical Approach To Pattern Recognition

© All Rights Reserved

12 (de) vizualizări

Statistical Approach To Pattern Recognition

© All Rights Reserved

- STA6166_HW2_Ramin_Shamshiri_Solution
- BIOL 2300 Homework 2 Summer2019
- 03 Quality Control Lecture #3(1)
- ch_06
- PPF notes
- Tetsuo Deguchi and Akihisa Yao- Scattering Functions and Correlation Functions of Random Knots
- Machine Learning (CSCI-567, Fall 2008) - Linear Discriminant Analysis
- Emery_Xav
- Curran,West&Finch (1996)
- 2008_Use of MATLAB in Teaching the Fundamentals
- ch1.5
- Koch I. Analysis of Multivariate and High-Dimensional Data 2013
- BUSINESS STATISTICS I.docx
- AN-1378
- Test Reliabilitychapter 8
- ch9.2
- pimrc04
- ex4
- Credit Scoring Models.pdf
- Basic Stats BrushUp

Sunteți pe pagina 1din 33

Statistical Approach to PR

Various PR approaches

1

2/21/2017

A fundamental statistical approach to the problem

of pattern classification.

categories is known perfectly.

classifying sea bass and salmon fish appears randomly on a belt, let c denote state

of nature or class:

c = 1 for sea bass c = 2 for salmon

State of nature is unpredictable, therefore must be probabilistically described as

priori probability. Prior probabilities reflect our knowledge of how likely each type of

fish will appear before we actually see it

1)Based on from collected samples as training data,

2) Based on the time of the year,

3) Based on location

Let

P(1) is the prior probability that the ish is a sea bass

P(2) is the prior probability that the fish is a salmon Prior

Also , assuming no other fish can appear, then

P(1) + P(2) = 1 (exclusivity and exhaustivity)

2

2/21/2017

Discrimination function

Case1: That we must make a decision without seeing the fish, i.e. no features

i.e., if N is the total number of available training patterns, and N1, N2 of them

belong to 1 and 2, respectively e.g.

Discriminating function:

if P(1 ) > P(2 ) : choose 1 otherwise choose 2

P(error) = P(choose 2 | 1 ).P(1 ) + P(choose 1| 2 )P(2 ) = min [p(1 ), p(2 )]

e.g. if we always choose c 1 ...

P(error) = 0 * 0.9 + 1 * 0.1 = 0.1

i.e. Probability of error ... 10% ==> minimal error

not well at all if P(1 ) = P(2 ) ) i.e. if they are equi-probable (uniform priors)

The classifier based on Bayesian decision theory design

needs to integrate all the available problem information,

such as measurement, priori probabilities, likelihood

and evidence to form the decision rules.

a) By calculating the posteriori probability p(X|i) from

the priori probability p(i|X). : Bayesian Theory.

error or risk and choosing a decision rule that

minimizes this measure.

3

2/21/2017

Bayes' Theorem

To derive Bayes' theorem, start from the definition of conditional probability. The

probability of the event A given the event B is

Discarding the middle term and dividing both sides by P(B), provided that neither

P(B) nor P(A) is 0, we obtain Bayes' theorem:

----------(1)

Bayes' Theorem

Bayes' theorem is often completed by noting that, according to the Law of Total

Probability

-------(1)

Substituting (2) into (1)

More generally, the Law of Total Probability states that given a partition or class

{Ai}, of the event space

-------(3)

4

2/21/2017

A pattern is represented by a set of d features, or

attributes, viewed as a d-dimensional feature

x ( x1 , x2 ,

vector. T

, xd )

Given a pattern x = (x1, x2, , xd)T, assign it to

one of n classes in set C.

C ={1, 2, , n}

P(i): various classes probabilities (Prior Probability calculated from

training data)

P(x| i): class-conditional probabilities (Likelihood Prob. calculated

from training data)

2

1 P(2)

P(1) P(x|2)

x

P(x| 1)

3

P(3)

P(x| 3)

5

2/21/2017

Posterior Probablity

P(i|X): Posterior probability that a test pattern X belongs to class i can be given

as:

P( X | i ) * P(i )

P(i | X )

P( X )

Likelihood * Pr ior

Posterior

Evidence

i.e. To classify a test pattern with attribute vector X, we assign it to the class most

probable for X.

This means, we estimate the P(i|X) for each class i=1 to n. Then, we assign the

pattern to the class with highest posterior probability. It is just like to maximize the

P(i|X).

P( X | i ).P(i )

Bayes classifier: P(i | X )

P( X )

So, maximizing this term means to maximize the right hand side of the

above equation. All the term can be calculated from training data

In case single feature x and multi-class then it is calculated as:

n

P( X x) p( x | i ). p(i )

i 1

P(X|i) : It needs huge computation to consider all the attributes. So,

assume that the attributes are independent to each other. This makes

the classifier as Nave Bayes classifier. i.e. d

P( X | i ) Pxk | i

k 1

6

2/21/2017

The simple probabilistic approach is not sufficient to classify the

fishes

measured on fish and incorporate the Bayes classification concept.

comprising single feature:

x as length (continuous random variable) feature

probability of x given the state of nature i). Its distribution

depends on the state of nature (class) i and feature value.

large number of training pattern samples (statistical).

p(x|1) and p(x| 2) showing the pdfs curves describe the

difference in populations of two type of classes e.g. Seabass

and Salmon fishes:

Hypothetical class-conditional

pdfs show the probability

density of measuring a part-

icular feature value x given

the pattern is in category i.

thus area under each curve is 1.0

7

2/21/2017

Suppose we know prior P( i) and p(x| i) and We measure the length of a

fish as the value x.

Now, the P( i|x) as the a posteriori probability (probability of the i given

the measurement of feature value x) can be given by bayes theorem . As

P( x | i ).P(i )

2

P(i | x) Where : P( x) p( x | i ). p( i )

P( x) i 1

P(1 | x) p( 2 | x) 1

Decision Rule:

1 if P(1 | x P 2 | x 1 if p( x | 1. p(1) px | 2. p( 2)

c

2 otherise 2 otherise

term. Coz, p(x), the evidence, is constant for all classes. Also if the classes

are equiprobable i.e. P( 1) = P( 2) = 0.5 then it can also be dropped from

comparison.

pdfs show the probability density of particular priors P( 1) = 2/3 and P( 2) = 1/3

measuring a particular feature value for fig (a).

x given the pattern is in category i.

Given that a pattern is measured to have

feature value x= 14, then

its probability for class 2 = 0.08 and

for class 1 = 0.92.

8

2/21/2017

Probability of Error

Remember that the goal is to minimize error.

whenever we observe a particular x, the probability of error is :

P(error|x) = P( 1|x) if we decide 2 in place of 1

P(error|x) = P( 2|x) if we decide 1 in place of 2

Therefore,

P(error|x) = min [P( 1|x), P( 2|x)]

largest of p( 1 |x) and p( 2|x) i.e.

Decide the class 1 if

p( 1 |x) > p( 2|x)

or

the border between

classes i and j, simply

where

P(i|x)=P(j|x)

Fig: Components for error with equal prior and non-optimal decision point x*.

The complete pink area (including triangled area) corresponds to the probability of

error for deciding 1 when the class is 2 and gray area corresponds to converse.

If we select the xB (as decision boundary) in place of x* then we can eliminate the

reducible error portion and minimize the error.

9

2/21/2017

Generalization:

Allowing more than two features:

replaces scalar x with a vector x from a d-dimensional feature

space Rd

deciding i for P(i|X) > P(j |X) for all i j

e.g. allows not to classify if dont know class

weighting decision costs

1. Gaussian (normal) density function

2. Uniform density function

Bayes classifier behavior is determined by

1) conditional density p(x|i)

2) a priori probability P(i)

distribution function, then to estimate p(x|i) for particular xs value is easy.

i.e. no need to find first actual pdf.

for pdf, because it is:

Extensive studies on it say that, it is followed by nature.

Well behaved and analytically tractable

An appropriate model for continuous values randomly distributed around mean .

Suitable for multivariate modeling.

Above all, the Gaussian PDF provide the optimal Bayesian classifiers.

10

2/21/2017

A univariate or single dimensional(d=1) gaussian function is defiend as:

in the range = |x- | 2 and

Peak has value p() = 1/

(b) Mean value = 1 and variance 2 = 0.2

The graphs are symmetric, and they are centered at the respective

mean value.

11

2/21/2017

A multivariate or d-dimensional gaussian function is defined as:

problem with Gaussian pdf:

Nos. of features d = 1 (x) Nos. of Classes: C = 2

Prior probabilities: P(1) = p(2) = 0.5 # Same for both classes for simplicity

Classes variances: 1 = 2 = # Same for both classes

Classes means: 1 2 # Different for both classes

Posterior probability:

Since p(x) is not taken into account, because it is the same for all classes and it does not

affect the decision. Furthermore, if the a priori probabilities are equal, then, its also

does not affect the comparison of two posterior values.

So, Decision Rule: max [ p(x|1), p(x|2)] # Also called maximum likelihood rule

i.e. The search for the maximum now rests on the values of the conditional pdfs

evaluated at x.

12

2/21/2017

with Gaussian pdf:

Line at x 0 is a threshold partitioning

the feature space into two regions,

R1 and R2. Decision boundary

a decision error for the case of two

equiprobable classes, is given by:

problem with Gaussian pdf:

Decision boundary is based on the prior probability, coz, it is the part of posterior

probability.

13

2/21/2017

Gaussian pdf:

To improve classification correctness more, we

add one more feature to our fish sorting.

by feature vector X comprising single feature

x1 as length

x2 as lightness intensity

probability density function for both classes.

obtained by observing a large number of

training pattern samples (statistical). Fig: It shows the probability density

distribution functions in 2D (two

By extending the bayes theorem for multi features) for classes 1 and 2. Since

features and multi-class problem we can we have included two features so, the

design the decision rule for our fish sorting decision boundaries are more clear

problem. resulting in minimizing the error.

Gaussian pdf

The general a multivariate gaussian pdf is defined as:

class-dependent gaussian density functions p(x|i) can be given as:

-------(1)

14

2/21/2017

Gaussian pdf

For classification: largest discrimination function for ith class can be given in term of

Posterior probability maximization as:

Since P(x) is constant for all classes we can drop the term p(x), So

gi(x) = p(x|i).p(i)

log() can be used for alternative discriminating function for ith class as:

gi(x) = log(p(x|i) .p(i))

From equations 1 and 2 we find:

----(3)

----(4)

Now we have following cases:

Case1:

i = = 2 . I (I stands for identity matrix : diagonal case) i.e. equal covariance

matrix for all classes and proportional to I.

Case2:

i = =arbitrary covariance matrix but identical for all classes

Case3:

i = arbitrary covariance matrix & not identical for all classes.

15

2/21/2017

Case1:

i = = 2. I (I stands for identity matrix : diagonal case) i.e. equal

covariance matrix for all classes and proportional to I.

-- --(4)

Assuming equal covariance matrices (i.e. class is only through the mean

vectors i ). We can drop 2nd term as class-independent constant

biases, So,

-------(5)

distance of feature vector x from the ith mean vector i weighted by the

inverse of the covariance matrix -1 .

Assuming features are statistically independent and each feature has the same variance 2.

Taking the following and putting in eq#5, we can write eq#6 as Simple discriminant function:

1 12 .I

x i

2

( x i ) .( x i ) x i

T 2 g i ( x) log( p(i )) --------(6)

2 2

If P(i) are the same for all c classes, then the log(P(i)) terms becomes another unimportant

additive constant that can be ignored.

x i

2

g i ( x)

2 2

discriminating factor. results. i.e. the gi (x) will be largest if the distance will be minimum.

The optimal decision rule:

To classify a feature vector x, measure the Euclidean distance || x- i|| 2 in between x and

each of the mean vector. And assign x to the category of the nearest means class.

Such classifier is called a minimum-distance classifier.

16

2/21/2017

Proof

The Bayesian discriminant function is given as:

x i

2

g i ( x)

2 2

x i xT .x 2.i x .i .i

2 T T

Since,

xT .x 2.i x .i .i

T T

gi( x)

then, 2 2

The quadratic term xTx is the same for all i, making it an ignorable additive constant.

Thus we obtain the equivalent linear discriminant function.

where, wi 0 21 2 i .i and

T

wi 2i

17

2/21/2017

The decision surfaces for a linear classifier are pieces of hyperplanes

defined by the linear equations gi(x) = gj(x) for the two categories with

the highest posterior Probabilities.

If x0 is threshold then, gi(x0)- gj(x0)=0

x i

2

Since,

g i ( x) log( p(i ))

2 2

So,

If P(i) = P(j), then point x0 is halfway between the means, and the

hyperplane is the perpendicular bisector of the line between the

means.

18

2/21/2017

matrix, then the distributions are hyperspherical in d dimensions, and the

boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line

separating the means.

If P( i) P( j), then point x0 shifts away from the more likely mean.

then the position of the decision boundary x0 is relatively insensitive

to the exact values of the prior probabilities P().

1-D Case

19

2/21/2017

2-D Case

3-D Case

20

2/21/2017

Case2: For i = =arbitrary covariance matrix but identical for each class

--------(4)

Due to identical covariance matrix, we can drop 2nd term and get:

--------(6)

If the prior probabilities P( i) are the same for all classes, then

logP(i) term can be ignored.

In this case the influence of covariance matrix will affect the distance

measurement, so this is not the simple Euclidean distance but it is

called as Mahalanobis distance from feature vector x to i. It is given

as:

21

2/21/2017

identical for all classes

2-D Case

22

2/21/2017

3-D Case

Case3: For i = arbitrary covariance matrix & not identical for each class but

Invertable.

In the general multivariate normal case, the covariance matrices are different for

each catergory. The discriminating function :

Also, due to arbitrary and class specific covariance matrix in first term, the resulting

discriminant functions are inherently quadratic. By expanding the first term we get:

23

2/21/2017

Decision boundaries

In the two-category case, the decision surface are hyperquadrics,

which can assume any of the general form of various type as:

1) Hyperplanes,

2) Pairs of hyperplanes,

3) Hyperspheres,

4) Hyperellipsoids,

5) Hyperparaboloids, and

6) Hyperhyperboloids

1-D Case

dimensions for Gaussians having unequal variance.

24

2/21/2017

2-D Case

Fig: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general

hyperquadrics. Conversely, given any hyperquadratic, one can find two Gaussian

distributions whose Bayes decision boundary is that hyperquadric.

3-D Case

boundaries that are two-dimensional hyperquadrics.

25

2/21/2017

Fig: The decision regions for four normal distributions. Even with such a low

number of categories, the shapes of the boundary regions can be rather complex.

26

2/21/2017

than probability of error.

In classification problem we face the following situations

equally costly.

In some other cases, error due classification mistakes are not equally

costly e.g. some mistakes lead to more costlier than others. e.g.

signal (e.g. a message or perhaps a radar return) and noise. Here,

our attempt is to develop strategies so that the incoming signal is

correctly classified.

27

2/21/2017

Case1: The incoming signal is a valid-signal and we correctly

classify it as such. ( a hit : ) 1-> 1

classify it as noise. ( a miss) 1-> 2

it as valid signal. (a false alarm) 2 -> 1

as noise. ( a true reject) 2 -> 2

Suppose after classification, we need to take some action as to fire a

missile or not.

In this case, we can assume that, action associated with wrong

classification is more costlier than right classification.

let = denote the action or decision taken as per the above four

cases.

e.g. 1 = fire a missile if signal (1)

2 = Not to fire a missile if noise (2)

manner.

28

2/21/2017

Let

X be d-dimensional vector and [1, 2 , n] finite set of n classes

class. Where i = the decision to choose class i

A decision rule: A function (x) e.g. mapping (x) -> i

( i|j) is a loss or cost function: estimating the loss or cost resulting

from taking action i when the class is j.

ij be the cost associated for selecting action i when the class j.

E.g. in case c = 2 (nos. of classes)

11 and 22 are the rewards for correct classification and

12 and 21 are the penalty for incorrect classification.

Risk R: In decision theoretic terminology, the expected loss is called

the Risk. Where R denote the over all risk.

R( i|x): It is conditional risk depend upon the action selected.

For n classes problem let X be classified as belonging to class k , then the probability

of error as per bayes rule : n

P(error | X ) P( | X )

i 1( i k )

i

But this bayes formulation for error does not describe the risk. So for risk:

The probability that we select i action, when j is the true class is given as:

P(ij) = P(i|j).P(j)

Since the term P(i|j) depends on the chosen mapping (x)-> i, which in turn

depends on x. Then the conditional risk is given as:

n n

R( i | x) ( i | j ). p( i | j ). p( j ) . p(

ij j | x)

j1 j1

29

2/21/2017

If c = 2 and Nos. of actions are 2 then the overall risk measure is:

R = 11.P(1|1).P(1) + 12. P(1|2).P(2)

+ 21.P(2|1).P(1) + 22.P(2|2).P(2)

Where,

For action 1, the measure of conditional risk:

R[(x)-> 1] = R(1|x) = 11.P(1|x) + 12.P(2|x)

R[(x)-> 2] = R(2|x) = 21.P(1|x) + 22.P(2|x)

incorrectly (using the total probability theorem):

For minimizing the overall risk, we have to minimize the conditional

risk. And the lower bound for this minimization is the Bayes risk.

Two actions 1 and 2 are to be taken for 1 and 2 respectively.

The decision rule is formulated as: decide the action 1 or class 1 if:

-----------(1)

30

2/21/2017

So alternatively under the assumption 21 > 11 i.e. the cost for

incorrect decision is higher then the correct one, we can write

decision rule as: Decide 1 if

------------(2)

probabilities [p(2)/p(1)] weighted by .

deciding 1 if the likelihood ratio exceeds a threshold value that

is independent of the observation x.

Effect of ij in case of n = 2 classes

11 = -2, 22 = -1 : for correct

12 = 2, 21 = 4 : for incorrect

becomes:

identifying the class 1 i.e signal than noise.

31

2/21/2017

Often we choose:

11 = 22 = 0, i.e. no cost or risk for correct classification

12 = 21 = 1, i.e. unit cost or risk for incorrect classification

So, in this case equation-2 becomes:

minimization of probability of error.

For n classes problem to decide the cost for correct and incorrect

classification we can define the cost/loss function as a zero-one loss

function:

probability of error, since the conditional risk for decision i is now

given as:

32

2/21/2017

should select the i that maximizes the posterior probability

P(i|x).

Thanks

33

- STA6166_HW2_Ramin_Shamshiri_SolutionÎncărcat deRaminShamshiri
- BIOL 2300 Homework 2 Summer2019Încărcat deTanner Johnson
- 03 Quality Control Lecture #3(1)Încărcat deFelipe
- ch_06Încărcat deHamed Nikbakht
- PPF notesÎncărcat deAlberto Carboni
- Tetsuo Deguchi and Akihisa Yao- Scattering Functions and Correlation Functions of Random KnotsÎncărcat deYokdm
- Machine Learning (CSCI-567, Fall 2008) - Linear Discriminant AnalysisÎncărcat deAnonymous PlvU0LG
- Emery_XavÎncărcat deEdward Chirinos
- Curran,West&Finch (1996)Încărcat deShare Wimby
- 2008_Use of MATLAB in Teaching the FundamentalsÎncărcat deCassiano R. N. Moralles
- ch1.5Încărcat dejuntujuntu
- Koch I. Analysis of Multivariate and High-Dimensional Data 2013Încărcat derciani
- BUSINESS STATISTICS I.docxÎncărcat demusajames
- AN-1378Încărcat deaccxer
- Test Reliabilitychapter 8Încărcat deOutofbox
- ch9.2Încărcat dechalachew77
- pimrc04Încărcat deJohn Tran
- ex4Încărcat deAyush Narayan
- Credit Scoring Models.pdfÎncărcat deAfeez
- Basic Stats BrushUpÎncărcat deCh Raghu
- Am J Clin Nutr 1976 Picciano 242 54.PDF HMCÎncărcat devirotica
- 332994865-03-Quality-Control-Lecture-3-1.pdfÎncărcat deJoJa JoJa
- ijem-10-486-Normality.pdfÎncărcat dejazzlovey
- Snowcover by Decision Tree Classification 2009 MacchiavelloÎncărcat deIvan Barria
- Brownian MotionÎncărcat deanderj017
- CompendiumÎncărcat desai420
- Chapter 9Încărcat deFrancisco Guerra
- mathÎncărcat dejoizylee
- hw3Încărcat deAlanYuen
- SPCÎncărcat deHarish Jere

- Design and Implementation of Energy Management System With Fuzzy Control for DC Microgrid SystemsÎncărcat dejaackson10
- CFD_Simulations_of_IC_Engines_Combustion.pdfÎncărcat deHarshit Saxena
- Level Cum Temp. Sensor boschrexroth-re50222.pdfÎncărcat devdrizzils
- Instruction Book FXe1-5 EnÎncărcat devisaguy
- Mu-h-regulator Package i Design ReportÎncărcat deHamza Riaz
- CYE for DownloadÎncărcat deLisa Woods
- ti013fÎncărcat dethota_sonu
- All Test ReportsÎncărcat deKailas Nimbalkar
- 100805 Pv Appliance Load CalculatorÎncărcat deanyoneuser88
- Music as Waves - How Pitch, Harmony, Rhythm, Form & Timbre are Fundamentally SimilarÎncărcat dealwinian
- chapt01PPÎncărcat delebchem
- B2 Stress IndexÎncărcat deParilla13
- Radiological Assessment of Sediment of Zobe Dam Dutsinma, Katsina State, Northern Nigerian.Încărcat deAJER JOURNAL
- FurnacesÎncărcat desaikiran
- 6Încărcat deMohamed Shehata
- OAG O&G Corporate ProfileÎncărcat detrkira6964
- Longitudinal ProfilÎncărcat deMarthen Peter
- ch02Încărcat deParas Gupta
- ENG CS 5-1773450-5 Section7 0313-5-1773450-5 Sec7 KilovacHighVoltageRelaysAndContactorsÎncărcat deANAKI
- relayÎncărcat deRazvan Mares
- kupdf.net_computer-graphics.pdfÎncărcat dedesign yr
- Transformer protection 20.pptÎncărcat demuaz_aminu1422
- Azimuth vs BearingÎncărcat dedjkovcin
- Stenter Exhaust Heat Recovery for Combustion Air PreheatingÎncărcat deMansoor Khanali
- SDS-01-M-01 Heavy Fuel Oil HeaterÎncărcat deAnonymous A4NHI5
- DOPÎncărcat dekishan80
- egee 101 reflective essay 1Încărcat deapi-142590237
- Emm Unit i Student FormatÎncărcat derajasekaran2323
- Electrical Resistivity Tomography Using Wenner bÎncărcat deKamel Hebbache
- BU01E20A01-01E AXFÎncărcat deChampa Jiménez de Costanillas

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.