03 Multivariate Normal Distribution

Last updated: Sept 20, 2012
MULTIVARIATE NORMAL DISTRIBUTION

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Linear Algebra
2
Probability & Bayesian Inference
Tutorial this Wed 3:00 4:30 in Bethune 228 !! Linear Algebra Reviews:
!!
!! Kolter,
Z., avail at http://cs229.stanford.edu/section/cs229-linalg.pdf !! Prince, Appendix C (up to and including C.7.1) !! Bishop, Appendix C !! Roweis, S., avail at http://www.cs.nyu.edu/~roweis/notes/matrixid.pdf
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Credits
3
!!
Some of these slides were sourced and/or modified from:

!! Christopher
Bishop, Microsoft UK !! Simon Prince, University College London !! Sergios Theodoridis, University of Athens & Konstantinos Koutroumbas, National Observatory of Athens
J. Elder
The Multivariate Normal Distribution: Topics

4
1.! 2.! 3.!
The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!
Maximum Likelihood Parameter Estimation Bayesian Parameter Estimation
J. Elder

5
1.! 2.! 3.!
1.! 2.!
J. Elder
The Multivariate Gaussian

6
MATLAB Statistics Toolbox Function: mvnpdf(x,mu,sigma)
J. Elder
Orthonormal Form
7
where ! " Mahalanobis distance from to x

MATLAB Statistics Toolbox Function: mahal(x,y)
Let A ! ! D" D . # is an eigenvalue and u is an eigenvector of A if Au = # u.

MATLAB Functions: [V, D]= eig(A) [V, D]= eigs(A, k)
Let ui and !i represent the ith eigenvector/eigenvalue pair of " : "ui = !iui
See Linear Algebra Review Resources on Moodle site for a review of eigenvectors.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Orthonormal Form
8
Since it is used in a quadratic form, we can assume that ! "1 is symmetric. This means that all of its eigenvalues and eigenvectors are real.
We are also implicitly assuming that !, and hence ! "1, are invertible (of full rank).
Thus ! can be represented in orthonormal form: ! = U "U t , where the columns of U are the eigenvectors ui of !, and " is the diagonal matrix with entries " ii = #i equal to the corresponding eigenvalues of !.
Thus the Mahalanobis distance ! 2 can be represented as: ! 2 = x " # "1 x " = x " U $ "1U t x " . Let y = U t x ! . Then we have,
1 " 2 = y t # !1y = $ y i # ! y j = $ %i!1y i2 , ij
) (
where y i = u x ! .
t i
ij
J. Elder
Geometry of the Multivariate Gaussian

9
! = Mahalanobis distance from to x

where (u i , !i ) are the ith eigenvector and eigenvalue of ".
or y = U(x - )
J. Elder
Moments of the Multivariate Gaussian

10
thanks to anti-symmetry of z
J. Elder
Moments of the Multivariate Gaussian

11
J. Elder
5.1 Application: Face Detection
Model # 1: Gaussian, uniform covariance

13
1/ 2
Fit model using maximum likelihood criterion m face m non-face
Pixel 2
Pixel 1
s! Face Face template 59.1
s! non-face 69.1
J. Elder
Model 1 Results
14
Results based on 200 cropped faces and 200 non-faces from the same database.
1 0.9 0.8 0.7 0.6
Receiver-Operator Characteristic (ROC)
How does this work with a real image?
Pr(Hit)
0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1
Pr(False Alarm)
Model # 2: Gaussian, diagonal covariance

15
1/ 2
Fit model using maximum likelihood criterion
m face
m non-face
Pixel 2
s! Face
s! non-face
Pixel 1
J. Elder
Model 2 Results
16
Results based on 200 cropped faces and 200 non-faces from the same database.
1 0.9 0.8 0.7 0.6
0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4
More sophisticated model unsurprisingly classifies new faces and non-faces better.
Diagonal Uniform
0.6 0.8 1
Pr(Hit)
Pr(False Alarm)
Model # 3: Gaussian, full covariance

17
1/ 2
Fit model using maximum likelihood criterion
PROBLEM: we cannot fit this model. We dont have enough data to estimate the full covariance matrix. N=400 training images D=10800 dimensions
Pixel 2
Total number of measured numbers = ND = 400x10,800 = 4,320,000 Total number of parameters in cov matrix = !"#$%"&'(()((10,800+1)x10,800/2 = 58,325,400
J. Elder
Pixel 1

18
1.! 2.! 3.!
1.! 2.!
J. Elder
Decision Surfaces
19
!!
If decision regions Ri and Rj are contiguous, dene!
g (x) ! P (" i | x) # P (" j | x)

!!
Then the decision surface !
Ri : P ! i | x > P ! j | x
g (x ) = 0
separates the two decision regions. g(x) is positive on one side and negative on the other.! !
Rj : P ! j | x > P !i | x
+ g (x ) = 0 -
19
J. Elder
Discriminant Functions
20

!!
20
If f (.) monotonic, the rule remains the same if we use:

x ! " i if: f (P (" i x )) > f (P (" j x )) # i $ j
!! !!
g i (x) ! f (P (" i | x)) is a discriminant function
In general, discriminant functions can be defined in other ways, independent of Bayes. In theory this will lead to a suboptimal solution However, non-Bayesian classifiers can have significant advantages:
!! !!
!! !!
Often a full Bayesian treatment is intractable or computationally prohibitive. Approximations made in a Bayesian treatment may lead to errors avoided by non-Bayesian methods.
Multivariate Normal Likelihoods

21

!!
21
Multivariate Gaussian pdf
p (x ! i ) =
1 (2" ) # i
D 2
1 2
& 1 ) % $1 exp ( $ (x $ i ) # i (x $ i ) + ' 2 *
. i = E , x ! i / # i = E ,(x $ i )(x $ i ) % ! i . /
J. Elder
Logarithmic Discriminant Function

22
22
p (x ! i ) =
1 (2" ) # i
D 2
1 2
& 1 ) % $1 exp ( $ (x $ i ) # i (x $ i ) + ' 2 *
!!
ln(!) is monotonic. Define:
g i (x ) = ln p x | ! i P ! i
=!
((
) ( )) = ln p (x | ! ) + ln P (! )
i i
1 (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2 where
Ci = !
D 1 ln2$ ! ln " i 2 2
J. Elder
Quadratic Classifiers
23
1 g i (x ) = ! (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2
!! !!
Thus the decision surface has a quadratic form. For a 2D input space, the decision curves are quadrics (ellipses, parabolas, hyperbolas or, in degenerate cases, lines).
0.5 0.4
0.3
0.2
0.1
0 10 5 0 !5 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition !10 !10 !5 0 5 10
J. Elder
Example: Isotropic Likelihoods

24
24
!!
Suppose that the two likelihoods are both isotropic, but with different means and variances. Then
g i (x) = !
!!
1 1 1 2 2 2 2 ( x + x ) + ( x + x ) ! ( + ) + ln P # i 1 2 i1 1 i2 2 i1 i2 2 2 2 2" i "i 2" i
( ( )) + C
And g i (x ) ! g j (x ) = 0 will be a quadratic equation in 2 variables.
J. Elder
Equal Covariances
25
!!
The quadratic term of the decision boundary is given by

1 T "1 x ! j " ! i"1 x 2
!!
Thus if the covariance matrices of the two likelihoods are identical, the decision boundary is linear.
Linear Classifier
26
1 g i (x ) = ! (x ! i )T " !1 (x ! i ) + ln P (# i ) + C i 2
!!
In this case, we can drop the quadratic terms and express the discriminant function in linear form:
g i (x ) = w i x + w io w i = ! "1 i wi 0
1 T "1 = ln P (# i ) " i ! i 2
J. Elder
Example 1: Isotropic, Identical Variance

27
g i (x ) = w i x + w io w i = ! "1 i w i 0 = ln P (# i ) "
1 T "1 ! i 2 i
Decision Boundary
! = " 2I . Then the decision surface has the form
w (x # x o ) = 0, where w = i # j , and P ($ i ) i # j 1 2 x o = ( i + j ) # " ln 2 P ($ j ) # 2 i j

Example 2: Equal Covariance

28
g i (x ) = w i x + w io w i = ! "1 i w i 0 = ln P (# i ) "
T
1 T "1 i ! i 2
g ij (x ) = w (x ! x 0 ) = 0 where
w = ! "1 ( i " j ),
# P (" ) & ! 1 i j i x 0 = ( i + j ) ! ln % , ( 2 2 $ P (" j ) ' ! i j
) !1
and
) !1
* (x ) x )
!1
1 2
J. Elder
Minimum Distance Classifiers

29
!!
If the two likelihoods have identical covariance AND the two classes are equiprobable, the discrimination function simplifies:
1 g i (x ) = ! (x ! i )T " !1 (x ! i ) 2
J. Elder
Isotropic Case
30
!!
In the isotropic case,

1 1 T !1 g i (x ) = ! (x ! i ) " (x ! i ) = ! 2 x ! i 2 2#
2
!!
Thus the Bayesian classifier simply assigns the class that minimizes the Euclidean distance de between the observed feature vector and the class mean.
de = x ! i
J. Elder
General Case: Mahalanobis Distance

31 !!
To deal with anisotropic distributions, we simply classify according to the Mahalanobis distance, dened as!
! = g i (x ) = (x " i ) # (x " i )
T
"1
1/2
Let y = U t x ! . Then we have,

1 " 2 = y t # !1y = $ y i # ! y j = $ %i!1y i2 , ij
where y i = uit x ! .
ij
J. Elder
General Case: Mahalanobis Distance

32
Let y = U t x ! . Then we have,

1 !1 2 " 2 = y t # !1y = $ y i # ! y = % $ i yi , ij j
where y i = uit x ! .
Thus the curves of constant Mahalanobis distance c have ellipsoidal form.
ij
J. Elder
Example:
33
Given ! 1 , ! 2 : # & 1 = % 0 ( , $ 0 '
P (! 1 ) = P (! 2 ) and p (x ! 1 ) = N ( 1 , "), p (x ! 2 ) = N ( 2 , "),

# 1.1 0.3 & % ( $ 0.3 1.9 ' & ( using Bayesian classification: '
# & 2 = % 3 ( , " = $ 3 ' # classify the vector x = % 1.0 $ 2.2
# & ! -1 = % 0.95 "0.15 ( $ "0.15 0.55 '

Compute Mahalanobis d m from 1 , 2 : ! # ! # ! 1.0, 2.2 # % &1 ' 1.0 ( = 2.952, d 2 = ! &2.0, &0.8 # % &1 ' &2.0 ( = 3.672 d2 = m ,1 m ,2 " $ " $ " 2.2 $ " &0.8 $
Classify x ! " 1 . Observe that d E ,2 < d E ,1
J. Elder

34
1.! 2.! 3.!
1.! 2.!
J. Elder
Maximum Likelihood Parameter Estimation

35
35
Suppose we believe input vectors x are distributed as p (x ) ! p (x ; " ), where " is an unknown parameter. Given independent training input vectors X = x 1 , x 2 , ...x N
we want to compute the maximum likelihood estimate " ML for " . Since the input vectors are independent, we have
p (X ; " ) ! p (x 1 , x 2 , ...x N ; " ) = # p (x k ; " )

k =1
J. Elder
Maximum Likelihood Parameter Estimation

36
36
p (X ; ! ) = " p (x k ; ! )
k =1
Let L(! ) " ln p (X ; ! ) = # ln p (x k ; ! )

k =1
The general method is to take the derivative of L with respect to ! , set it to 0 and solve for ! : ML : ! $L(! ) = $(! ) $ln p (x k ; ! ) % $(! ) = 0 k =1
N
J. Elder
Properties of the Maximum Likelihood Estimator

37
37
Let ! 0 be the true value of the unknown parameter vector. Then ! ML is asymptotically unbiased: lim E [! ML ] = ! 0
N"#
ML % ! 0 ! ML is asymptotically consistent: lim E !

N"$
=0
J. Elder
Example: Univariate Normal

38
*+,-.+/001(2345604(
J. Elder

39
J. Elder

40
Thus ! ML is biased (although asymptotically unbiased).

Example: Multivariate Normal

41
!!
Given i.i.d. data hood function is given by
, the log likeli-
J. Elder
Maximum Likelihood for the Gaussian

42
!!
Set the derivative of the log likelihood function to zero,
!!
and solve to obtain One can also show that
!!
" % ! t ! t Recall: If x and a are vectors, then xa = a x = a' $ !x !x # &
( )
( )
J. Elder

43
1.! 2.! 3.!
1.! 2.!
J. Elder
Bayesian Inference for the Gaussian (Univariate Case)

44
!!
Assume ! 2 is known. Given i.i.d. data , the .+,-.+/001 function for is given by
!!
This has a Gaussian shape as a function of (but it is not a distribution over ).
J. Elder
Bayesian Inference for the Gaussian (Univariate Case)

45
!!
Combined with a Gaussian prior over , this gives the posterior
!!
!!
Completing the square over , we see that
J. Elder
Bayesian Inference for the Gaussian

46
!!
where
Shortcut: p ( | X ) has the form C exp !" 2 . Get " 2 in form a 2 ! 2b + c = a ( ! b / a )2 + const and identify
( )
N = b / a
1 =a 2 #N
!!
Note:
Bayesian Inference for the Gaussian

47
!!
Example:
0 = 0 = 0.8 ! 2 = 0.1
J. Elder
Maximum a Posteriori (MAP) Estimation

48
In MAP estimation, we use the value of that maximizes the posterior p | X :
MAP = N .
J. Elder
Full Bayesian Parameter Estimation

49
In both ML and MAP, we use the training data X to estimate a specific value for the unknown parameter vector , and then use that value for subsequent inference on new observations x: p x | ! !! These methods are suboptimal, because in fact we are always uncertain about the exact value of , and to be optimal we should take into account the possibility that assumes other values.
!!
J. Elder
Full Bayesian Parameter Estimation

50
In full Bayesian parameter estimation, we do not estimate a specific value for . !! Instead, we compute the posterior over , and then integrate it out when computing p ( x | X ) :
!!
p (x X ) = " p (x ! ) p (! X )d ! p (! X ) = p (X ! ) p (! ) p (X )
N k =1
p (X ! ) p (! )
" p (X ! ) p (! )d !
p (X ! ) = # p (x k ! )
J. Elder
Example: Univariate Normal with Unknown Mean

51
Consider again the case p (x ) ! N , ! where ! is known and ! N 0 , ! 0

2 We showed that p |X ! N N , ! N , where
( )
In the MAP approach, we approximate p (x X ) ! N N ,! 2
In the full Bayesian approach, we calculate p (x X ) = ! p (x | ) p ( X ) d

2 which can be shown to yield p (x X ) ! N N , ! 2 + ! N
J. Elder
Comparison: MAP vs Full Bayesian Estimation

52
!!
MAP: Full Bayesian:
p (x X ) ! N N ,! 2
!!
2 p (x X ) ! N N , ! 2 + ! N
!!
The higher (and more realistic) uncertainty in the full Bayesian approach reflects our posterior uncertainty about the exact value of the mean .
J. Elder

03 Multivariate Normal Distribution

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

03 Multivariate Normal Distribution

Încărcat de

Drepturi de autor:

Formate disponibile

Last updated: Sept 20, 2012

MULTIVARIATE NORMAL DISTRIBUTION

Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Probability & Bayesian Inference

Some of these slides were sourced and/or modified from:

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

The Multivariate Normal Distribution: Topics

Probability & Bayesian Inference

1.! 2.! 3.!

Maximum Likelihood Parameter Estimation Bayesian Parameter Estimation

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

The Multivariate Normal Distribution: Topics

Probability & Bayesian Inference

1.! 2.! 3.!

Maximum Likelihood Parameter Estimation Bayesian Parameter Estimation

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

The Multivariate Gaussian

Probability & Bayesian Inference

MATLAB Statistics Toolbox Function: mvnpdf(x,mu,sigma)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Probability & Bayesian Inference

where ! " Mahalanobis distance from to x

Let A ! ! D" D . # is an eigenvalue and u is an eigenvector of A if Au = # u.

Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Geometry of the Multivariate Gaussian

Probability & Bayesian Inference

! = Mahalanobis distance from to x

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Moments of the Multivariate Gaussian

Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Moments of the Multivariate Gaussian

Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

5.1 Application: Face Detection

Model # 1: Gaussian, uniform covariance

Probability & Bayesian Inference

Fit model using maximum likelihood criterion m face m non-face

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

s! Face Face template 59.1

Probability & Bayesian Inference

Receiver-Operator Characteristic (ROC)

How does this work with a real image?

0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1

Model # 2: Gaussian, diagonal covariance

Probability & Bayesian Inference

Fit model using maximum likelihood criterion

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Probability & Bayesian Inference

0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4

Model # 3: Gaussian, full covariance

Probability & Bayesian Inference

Fit model using maximum likelihood criterion

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

The Multivariate Normal Distribution: Topics

Probability & Bayesian Inference

1.! 2.! 3.!

Maximum Likelihood Parameter Estimation Bayesian Parameter Estimation

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Probability & Bayesian Inference

If decision regions Ri and Rj are contiguous, dene!

g (x) ! P (" i | x) # P (" j | x)