Documente Academic
Documente Profesional
Documente Cultură
Linear Algebra
2
Tutorial this Wed 3:00 4:30 in Bethune 228 !! Linear Algebra Reviews:
!!
!! Kolter,
Z., avail at http://cs229.stanford.edu/section/cs229-linalg.pdf !! Prince, Appendix C (up to and including C.7.1) !! Bishop, Appendix C !! Roweis, S., avail at http://www.cs.nyu.edu/~roweis/notes/matrixid.pdf
J. Elder
Credits
3
!!
Bishop, Microsoft UK !! Simon Prince, University College London !! Sergios Theodoridis, University of Athens & Konstantinos Koutroumbas, National Observatory of Athens
J. Elder
The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!
J. Elder
The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!
J. Elder
J. Elder
Orthonormal Form
7
Let ui and !i represent the ith eigenvector/eigenvalue pair of " : "ui = !iui
See Linear Algebra Review Resources on Moodle site for a review of eigenvectors.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Orthonormal Form
8
Since it is used in a quadratic form, we can assume that ! "1 is symmetric. This means that all of its eigenvalues and eigenvectors are real.
We are also implicitly assuming that !, and hence ! "1, are invertible (of full rank).
Thus ! can be represented in orthonormal form: ! = U "U t , where the columns of U are the eigenvectors ui of !, and " is the diagonal matrix with entries " ii = #i equal to the corresponding eigenvalues of !.
Thus the Mahalanobis distance ! 2 can be represented as: ! 2 = x " # "1 x " = x " U $ "1U t x " . Let y = U t x ! . Then we have,
1 " 2 = y t # !1y = $ y i # ! y j = $ %i!1y i2 , ij
) (
where y i = u x ! .
t i
ij
J. Elder
or y = U(x - )
J. Elder
thanks to anti-symmetry of z
J. Elder
J. Elder
1/ 2
Pixel 2
Pixel 1
s! non-face 69.1
J. Elder
Model 1 Results
14
Results based on 200 cropped faces and 200 non-faces from the same database.
1 0.9 0.8 0.7 0.6
Pr(Hit)
Pr(False Alarm)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
1/ 2
m face
m non-face
Pixel 2
s! Face
s! non-face
Pixel 1
J. Elder
Model 2 Results
16
Results based on 200 cropped faces and 200 non-faces from the same database.
1 0.9 0.8 0.7 0.6
More sophisticated model unsurprisingly classifies new faces and non-faces better.
Diagonal Uniform
0.6 0.8 1
Pr(Hit)
Pr(False Alarm)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
1/ 2
PROBLEM: we cannot fit this model. We dont have enough data to estimate the full covariance matrix. N=400 training images D=10800 dimensions
Pixel 2
Total number of measured numbers = ND = 400x10,800 = 4,320,000 Total number of parameters in cov matrix = !"#$%"&'(()((10,800+1)x10,800/2 = 58,325,400
J. Elder
Pixel 1
The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!
J. Elder
Decision Surfaces
19
!!
Ri : P ! i | x > P ! j | x
g (x ) = 0
separates the two decision regions. g(x) is positive on one side and negative on the other.! !
Rj : P ! j | x > P !i | x
+ g (x ) = 0 -
19
J. Elder
Discriminant Functions
20
20
!! !!
In general, discriminant functions can be defined in other ways, independent of Bayes. In theory this will lead to a suboptimal solution However, non-Bayesian classifiers can have significant advantages:
!! !!
!! !!
Often a full Bayesian treatment is intractable or computationally prohibitive. Approximations made in a Bayesian treatment may lead to errors avoided by non-Bayesian methods.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
21
p (x ! i ) =
1 (2" ) # i
D 2
1 2
. i = E , x ! i / # i = E ,(x $ i )(x $ i ) % ! i . /
J. Elder
22
p (x ! i ) =
1 (2" ) # i
D 2
1 2
!!
g i (x ) = ln p x | ! i P ! i
=!
((
) ( )) = ln p (x | ! ) + ln P (! )
i i
Ci = !
D 1 ln2$ ! ln " i 2 2
J. Elder
Quadratic Classifiers
23
1 g i (x ) = ! (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2
!! !!
Thus the decision surface has a quadratic form. For a 2D input space, the decision curves are quadrics (ellipses, parabolas, hyperbolas or, in degenerate cases, lines).
0.5 0.4
0.3
0.2
0.1
0 10 5 0 !5 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition !10 !10 !5 0 5 10
J. Elder
24
!!
1 g i (x ) = ! (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2
Suppose that the two likelihoods are both isotropic, but with different means and variances. Then
g i (x) = !
!!
( ( )) + C
J. Elder
Equal Covariances
25
1 g i (x ) = ! (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2
!!
!!
Thus if the covariance matrices of the two likelihoods are identical, the decision boundary is linear.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
Linear Classifier
26
1 g i (x ) = ! (x ! i )T " !1 (x ! i ) + ln P (# i ) + C i 2
!!
In this case, we can drop the quadratic terms and express the discriminant function in linear form:
g i (x ) = w i x + w io w i = ! "1 i wi 0
1 T "1 = ln P (# i ) " i ! i 2
J. Elder
g i (x ) = w i x + w io w i = ! "1 i w i 0 = ln P (# i ) "
1 T "1 ! i 2 i
Decision Boundary
g i (x ) = w i x + w io w i = ! "1 i w i 0 = ln P (# i ) "
T
1 T "1 i ! i 2
g ij (x ) = w (x ! x 0 ) = 0 where
w = ! "1 ( i " j ),
# P (" ) & ! 1 i j i x 0 = ( i + j ) ! ln % , ( 2 2 $ P (" j ) ' ! i j
) !1
and
) !1
* (x ) x )
!1
1 2
J. Elder
!!
If the two likelihoods have identical covariance AND the two classes are equiprobable, the discrimination function simplifies:
1 g i (x ) = ! (x ! i )T " i!1 (x ! i ) + ln P (# i ) + C i 2
1 g i (x ) = ! (x ! i )T " !1 (x ! i ) 2
J. Elder
Isotropic Case
30
!!
!!
Thus the Bayesian classifier simply assigns the class that minimizes the Euclidean distance de between the observed feature vector and the class mean.
de = x ! i
J. Elder
To deal with anisotropic distributions, we simply classify according to the Mahalanobis distance, dened as!
! = g i (x ) = (x " i ) # (x " i )
T
"1
1/2
where y i = uit x ! .
ij
J. Elder
where y i = uit x ! .
Thus the curves of constant Mahalanobis distance c have ellipsoidal form.
ij
J. Elder
Example:
33
J. Elder
The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!
J. Elder
35
Suppose we believe input vectors x are distributed as p (x ) ! p (x ; " ), where " is an unknown parameter. Given independent training input vectors X = x 1 , x 2 , ...x N
we want to compute the maximum likelihood estimate " ML for " . Since the input vectors are independent, we have
J. Elder
36
p (X ; ! ) = " p (x k ; ! )
k =1
The general method is to take the derivative of L with respect to ! , set it to 0 and solve for ! : ML : ! $L(! ) = $(! ) $ln p (x k ; ! ) % $(! ) = 0 k =1
N
J. Elder
37
Let ! 0 be the true value of the unknown parameter vector. Then ! ML is asymptotically unbiased: lim E [! ML ] = ! 0
N"#
=0
J. Elder
*+,-.+/001(2345604(
J. Elder
J. Elder
!!
J. Elder
!!
!!
!!
( )
( )
J. Elder
The Multivariate Normal Distribution Decision Boundaries in Higher Dimensions Parameter Estimation
1.! 2.!
J. Elder
!!
Assume ! 2 is known. Given i.i.d. data , the .+,-.+/001 function for is given by
!!
J. Elder
!!
!!
!!
J. Elder
!!
where
Shortcut: p ( | X ) has the form C exp !" 2 . Get " 2 in form a 2 ! 2b + c = a ( ! b / a )2 + const and identify
( )
N = b / a
1 =a 2 #N
!!
Note:
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder
!!
Example:
0 = 0 = 0.8 ! 2 = 0.1
J. Elder
MAP = N .
J. Elder
In both ML and MAP, we use the training data X to estimate a specific value for the unknown parameter vector , and then use that value for subsequent inference on new observations x: p x | ! !! These methods are suboptimal, because in fact we are always uncertain about the exact value of , and to be optimal we should take into account the possibility that assumes other values.
!!
J. Elder
In full Bayesian parameter estimation, we do not estimate a specific value for . !! Instead, we compute the posterior over , and then integrate it out when computing p ( x | X ) :
!!
p (x X ) = " p (x ! ) p (! X )d ! p (! X ) = p (X ! ) p (! ) p (X )
N k =1
p (X ! ) p (! )
" p (X ! ) p (! )d !
p (X ! ) = # p (x k ! )
J. Elder
( )
J. Elder
!!
p (x X ) ! N N ,! 2
!!
2 p (x X ) ! N N , ! 2 + ! N
!!
The higher (and more realistic) uncertainty in the full Bayesian approach reflects our posterior uncertainty about the exact value of the mean .
J. Elder