Sunteți pe pagina 1din 52

Last updated: Sept 20, 2012

MULTIVARIATE NORMAL DISTRIBUTION


J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Algebra
2

Probability & Bayesian Inference

Tutorial this Wed 3:00 4:30 in Bethune 228 !! Linear Algebra Reviews:
!!

!! Kolter,

Z., avail at http://cs229.stanford.edu/section/cs229-linalg.pdf !! Prince, Appendix C (up to and including C.7.1) !! Bishop, Appendix C !! Roweis, S., avail at http://www.cs.nyu.edu/~roweis/notes/matrixid.pdf

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Credits
3

Probability & Bayesian Inference

!!

Some of these slides were sourced and/or modified from:


!! Christopher

Bishop, Microsoft UK !! Simon Prince, University College London !! Sergios Theodoridis, University of Athens & Konstantinos Koutroumbas, National Observatory of Athens

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Multivariate Normal Distribution: Topics


4

Probability & Bayesian Inference

1.! 2.! 3.!

The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!

Maximum Likelihood Parameter Estimation Bayesian Parameter Estimation

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Multivariate Normal Distribution: Topics


5

Probability & Bayesian Inference

1.! 2.! 3.!

The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!

Maximum Likelihood Parameter Estimation Bayesian Parameter Estimation

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Multivariate Gaussian


6

Probability & Bayesian Inference

MATLAB Statistics Toolbox Function: mvnpdf(x,mu,sigma)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Orthonormal Form
7

Probability & Bayesian Inference

where ! " Mahalanobis distance from to x


MATLAB Statistics Toolbox Function: mahal(x,y)

Let A ! ! D" D . # is an eigenvalue and u is an eigenvector of A if Au = # u.


MATLAB Functions: [V, D]= eig(A) [V, D]= eigs(A, k)

Let ui and !i represent the ith eigenvector/eigenvalue pair of " : "ui = !iui

See Linear Algebra Review Resources on Moodle site for a review of eigenvectors.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Orthonormal Form
8

Probability & Bayesian Inference

Since it is used in a quadratic form, we can assume that ! "1 is symmetric. This means that all of its eigenvalues and eigenvectors are real.

We are also implicitly assuming that !, and hence ! "1, are invertible (of full rank).
Thus ! can be represented in orthonormal form: ! = U "U t , where the columns of U are the eigenvectors ui of !, and " is the diagonal matrix with entries " ii = #i equal to the corresponding eigenvalues of !.

Thus the Mahalanobis distance ! 2 can be represented as: ! 2 = x " # "1 x " = x " U $ "1U t x " . Let y = U t x ! . Then we have,
1 " 2 = y t # !1y = $ y i # ! y j = $ %i!1y i2 , ij

) (

where y i = u x ! .
t i

ij

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Geometry of the Multivariate Gaussian


9

Probability & Bayesian Inference

! = Mahalanobis distance from to x


where (u i , !i ) are the ith eigenvector and eigenvalue of ".

or y = U(x - )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Moments of the Multivariate Gaussian


10

Probability & Bayesian Inference

thanks to anti-symmetry of z

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Moments of the Multivariate Gaussian


11

Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

5.1 Application: Face Detection

Model # 1: Gaussian, uniform covariance


13

Probability & Bayesian Inference

1/ 2

Fit model using maximum likelihood criterion m face m non-face

Pixel 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Pixel 1

s! Face Face template 59.1

s! non-face 69.1
J. Elder

Model 1 Results
14

Probability & Bayesian Inference

Results based on 200 cropped faces and 200 non-faces from the same database.
1 0.9 0.8 0.7 0.6

Receiver-Operator Characteristic (ROC)

How does this work with a real image?

Pr(Hit)

0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1

Pr(False Alarm)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Model # 2: Gaussian, diagonal covariance


15

Probability & Bayesian Inference

1/ 2

Fit model using maximum likelihood criterion

m face

m non-face

Pixel 2

s! Face

s! non-face

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Pixel 1

J. Elder

Model 2 Results
16

Probability & Bayesian Inference

Results based on 200 cropped faces and 200 non-faces from the same database.
1 0.9 0.8 0.7 0.6

0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4

More sophisticated model unsurprisingly classifies new faces and non-faces better.
Diagonal Uniform
0.6 0.8 1

Pr(Hit)

Pr(False Alarm)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Model # 3: Gaussian, full covariance


17

Probability & Bayesian Inference

1/ 2

Fit model using maximum likelihood criterion

PROBLEM: we cannot fit this model. We dont have enough data to estimate the full covariance matrix. N=400 training images D=10800 dimensions

Pixel 2

Total number of measured numbers = ND = 400x10,800 = 4,320,000 Total number of parameters in cov matrix = !"#$%"&'(()((10,800+1)x10,800/2 = 58,325,400
J. Elder

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Pixel 1

The Multivariate Normal Distribution: Topics


18

Probability & Bayesian Inference

1.! 2.! 3.!

The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!

Maximum Likelihood Parameter Estimation Bayesian Parameter Estimation

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Decision Surfaces
19

Probability & Bayesian Inference

!!

If decision regions Ri and Rj are contiguous, dene!

g (x) ! P (" i | x) # P (" j | x)


!!

Then the decision surface !

Ri : P ! i | x > P ! j | x

g (x ) = 0
separates the two decision regions. g(x) is positive on one side and negative on the other.! !
Rj : P ! j | x > P !i | x

+ g (x ) = 0 -

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

19

J. Elder

Discriminant Functions
20

Probability & Bayesian Inference


!!

20

If f (.) monotonic, the rule remains the same if we use:


x ! " i if: f (P (" i x )) > f (P (" j x )) # i $ j

!! !!

g i (x) ! f (P (" i | x)) is a discriminant function

In general, discriminant functions can be defined in other ways, independent of Bayes. In theory this will lead to a suboptimal solution However, non-Bayesian classifiers can have significant advantages:
!! !!

!! !!

Often a full Bayesian treatment is intractable or computationally prohibitive. Approximations made in a Bayesian treatment may lead to errors avoided by non-Bayesian methods.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Multivariate Normal Likelihoods


21

Probability & Bayesian Inference


!!

21

Multivariate Gaussian pdf

p (x ! i ) =

1 (2" ) # i
D 2
1 2

& 1 ) % $1 exp ( $ (x $ i ) # i (x $ i ) + ' 2 *

. i = E , x ! i / # i = E ,(x $ i )(x $ i ) % ! i . /

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Logarithmic Discriminant Function


22

Probability & Bayesian Inference

22

p (x ! i ) =

1 (2" ) # i
D 2
1 2

& 1 ) % $1 exp ( $ (x $ i ) # i (x $ i ) + ' 2 *

!!

ln(!) is monotonic. Define:

g i (x ) = ln p x | ! i P ! i
=!

((

) ( )) = ln p (x | ! ) + ln P (! )
i i

1 (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2 where

Ci = !

D 1 ln2$ ! ln " i 2 2
J. Elder

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Quadratic Classifiers
23

Probability & Bayesian Inference

1 g i (x ) = ! (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2
!! !!

Thus the decision surface has a quadratic form. For a 2D input space, the decision curves are quadrics (ellipses, parabolas, hyperbolas or, in degenerate cases, lines).
0.5 0.4

0.3

0.2

0.1

0 10 5 0 !5 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition !10 !10 !5 0 5 10

J. Elder

Example: Isotropic Likelihoods


24

Probability & Bayesian Inference

24
!!

1 g i (x ) = ! (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2
Suppose that the two likelihoods are both isotropic, but with different means and variances. Then

g i (x) = !
!!

1 1 1 2 2 2 2 ( x + x ) + ( x + x ) ! ( + ) + ln P # i 1 2 i1 1 i2 2 i1 i2 2 2 2 2" i "i 2" i

( ( )) + C

And g i (x ) ! g j (x ) = 0 will be a quadratic equation in 2 variables.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Equal Covariances
25

Probability & Bayesian Inference

1 g i (x ) = ! (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2
!!

The quadratic term of the decision boundary is given by


1 T "1 x ! j " ! i"1 x 2

!!

Thus if the covariance matrices of the two likelihoods are identical, the decision boundary is linear.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Linear Classifier
26

Probability & Bayesian Inference

1 g i (x ) = ! (x ! i )T " !1 (x ! i ) + ln P (# i ) + C i 2
!!

In this case, we can drop the quadratic terms and express the discriminant function in linear form:

g i (x ) = w i x + w io w i = ! "1 i wi 0
1 T "1 = ln P (# i ) " i ! i 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Example 1: Isotropic, Identical Variance


27

Probability & Bayesian Inference

g i (x ) = w i x + w io w i = ! "1 i w i 0 = ln P (# i ) "
1 T "1 ! i 2 i

Decision Boundary

! = " 2I . Then the decision surface has the form

w (x # x o ) = 0, where w = i # j , and P ($ i ) i # j 1 2 x o = ( i + j ) # " ln 2 P ($ j ) # 2 i j


CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Example 2: Equal Covariance


28

Probability & Bayesian Inference

g i (x ) = w i x + w io w i = ! "1 i w i 0 = ln P (# i ) "
T

1 T "1 i ! i 2

g ij (x ) = w (x ! x 0 ) = 0 where

w = ! "1 ( i " j ),
# P (" ) & ! 1 i j i x 0 = ( i + j ) ! ln % , ( 2 2 $ P (" j ) ' ! i j
) !1

and

) !1

* (x ) x )

!1

1 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Minimum Distance Classifiers


29

Probability & Bayesian Inference

!!

If the two likelihoods have identical covariance AND the two classes are equiprobable, the discrimination function simplifies:
1 g i (x ) = ! (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2

1 g i (x ) = ! (x ! i )T " !1 (x ! i ) 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Isotropic Case
30

Probability & Bayesian Inference

!!

In the isotropic case,


1 1 T !1 g i (x ) = ! (x ! i ) " (x ! i ) = ! 2 x ! i 2 2#
2

!!

Thus the Bayesian classifier simply assigns the class that minimizes the Euclidean distance de between the observed feature vector and the class mean.
de = x ! i

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

General Case: Mahalanobis Distance


31 !!

Probability & Bayesian Inference

To deal with anisotropic distributions, we simply classify according to the Mahalanobis distance, dened as!

! = g i (x ) = (x " i ) # (x " i )
T
"1

1/2

Let y = U t x ! . Then we have,


1 " 2 = y t # !1y = $ y i # ! y j = $ %i!1y i2 , ij

where y i = uit x ! .

ij

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

General Case: Mahalanobis Distance


32

Probability & Bayesian Inference

Let y = U t x ! . Then we have,


1 !1 2 " 2 = y t # !1y = $ y i # ! y = % $ i yi , ij j

where y i = uit x ! .
Thus the curves of constant Mahalanobis distance c have ellipsoidal form.

ij

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Example:
33

Probability & Bayesian Inference

Given ! 1 , ! 2 : # & 1 = % 0 ( , $ 0 '

P (! 1 ) = P (! 2 ) and p (x ! 1 ) = N ( 1 , "), p (x ! 2 ) = N ( 2 , "),


# 1.1 0.3 & % ( $ 0.3 1.9 ' & ( using Bayesian classification: '

# & 2 = % 3 ( , " = $ 3 ' # classify the vector x = % 1.0 $ 2.2

# & ! -1 = % 0.95 "0.15 ( $ "0.15 0.55 '


Compute Mahalanobis d m from 1 , 2 : ! # ! # ! 1.0, 2.2 # % &1 ' 1.0 ( = 2.952, d 2 = ! &2.0, &0.8 # % &1 ' &2.0 ( = 3.672 d2 = m ,1 m ,2 " $ " $ " 2.2 $ " &0.8 $

Classify x ! " 1 . Observe that d E ,2 < d E ,1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Multivariate Normal Distribution: Topics


34

Probability & Bayesian Inference

1.! 2.! 3.!

The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!

Maximum Likelihood Parameter Estimation Bayesian Parameter Estimation

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Maximum Likelihood Parameter Estimation


35

Probability & Bayesian Inference

35

Suppose we believe input vectors x are distributed as p (x ) ! p (x ; " ), where " is an unknown parameter. Given independent training input vectors X = x 1 , x 2 , ...x N

we want to compute the maximum likelihood estimate " ML for " . Since the input vectors are independent, we have

p (X ; " ) ! p (x 1 , x 2 , ...x N ; " ) = # p (x k ; " )


k =1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Maximum Likelihood Parameter Estimation


36

Probability & Bayesian Inference

36

p (X ; ! ) = " p (x k ; ! )
k =1

Let L(! ) " ln p (X ; ! ) = # ln p (x k ; ! )


k =1

The general method is to take the derivative of L with respect to ! , set it to 0 and solve for ! : ML : ! $L(! ) = $(! ) $ln p (x k ; ! ) % $(! ) = 0 k =1
N
J. Elder

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Properties of the Maximum Likelihood Estimator


37

Probability & Bayesian Inference

37

Let ! 0 be the true value of the unknown parameter vector. Then ! ML is asymptotically unbiased: lim E [! ML ] = ! 0
N"#

ML % ! 0 ! ML is asymptotically consistent: lim E !


N"$

=0

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Example: Univariate Normal


38

Probability & Bayesian Inference

*+,-.+/001(2345604(

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Example: Univariate Normal


39

Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Example: Univariate Normal


40

Probability & Bayesian Inference

Thus ! ML is biased (although asymptotically unbiased).


CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Example: Multivariate Normal


41

Probability & Bayesian Inference

!!

Given i.i.d. data hood function is given by

, the log likeli-

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Maximum Likelihood for the Gaussian


42

Probability & Bayesian Inference

!!

Set the derivative of the log likelihood function to zero,

!!

and solve to obtain One can also show that

!!

" % ! t ! t Recall: If x and a are vectors, then xa = a x = a' $ !x !x # &

( )

( )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Multivariate Normal Distribution: Topics


43

Probability & Bayesian Inference

1.! 2.! 3.!

The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!

Maximum Likelihood Parameter Estimation Bayesian Parameter Estimation

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Bayesian Inference for the Gaussian (Univariate Case)


44

Probability & Bayesian Inference

!!

Assume ! 2 is known. Given i.i.d. data , the .+,-.+/001 function for is given by

!!

This has a Gaussian shape as a function of (but it is not a distribution over ).

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Bayesian Inference for the Gaussian (Univariate Case)


45

Probability & Bayesian Inference

!!

Combined with a Gaussian prior over , this gives the posterior

!!

!!

Completing the square over , we see that

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Bayesian Inference for the Gaussian


46

Probability & Bayesian Inference

!!

where

Shortcut: p ( | X ) has the form C exp !" 2 . Get " 2 in form a 2 ! 2b + c = a ( ! b / a )2 + const and identify

( )

N = b / a
1 =a 2 #N
!!

Note:
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder

Bayesian Inference for the Gaussian


47

Probability & Bayesian Inference

!!

Example:
0 = 0 = 0.8 ! 2 = 0.1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Maximum a Posteriori (MAP) Estimation


48

Probability & Bayesian Inference

In MAP estimation, we use the value of that maximizes the posterior p | X :

MAP = N .

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Full Bayesian Parameter Estimation


49

Probability & Bayesian Inference

In both ML and MAP, we use the training data X to estimate a specific value for the unknown parameter vector , and then use that value for subsequent inference on new observations x: p x | ! !! These methods are suboptimal, because in fact we are always uncertain about the exact value of , and to be optimal we should take into account the possibility that assumes other values.
!!

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Full Bayesian Parameter Estimation


50

Probability & Bayesian Inference

In full Bayesian parameter estimation, we do not estimate a specific value for . !! Instead, we compute the posterior over , and then integrate it out when computing p ( x | X ) :
!!

p (x X ) = " p (x ! ) p (! X )d ! p (! X ) = p (X ! ) p (! ) p (X )
N k =1

p (X ! ) p (! )

" p (X ! ) p (! )d !

p (X ! ) = # p (x k ! )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Example: Univariate Normal with Unknown Mean


51

Probability & Bayesian Inference

Consider again the case p (x ) ! N , ! where ! is known and ! N 0 , ! 0


2 We showed that p |X ! N N , ! N , where

( )

In the MAP approach, we approximate p (x X ) ! N N ,! 2

In the full Bayesian approach, we calculate p (x X ) = ! p (x | ) p ( X ) d


2 which can be shown to yield p (x X ) ! N N , ! 2 + ! N

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Comparison: MAP vs Full Bayesian Estimation


52

Probability & Bayesian Inference

!!

MAP: Full Bayesian:

p (x X ) ! N N ,! 2

!!

2 p (x X ) ! N N , ! 2 + ! N

!!

The higher (and more realistic) uncertainty in the full Bayesian approach reflects our posterior uncertainty about the exact value of the mean .

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

S-ar putea să vă placă și