Documente Academic
Documente Profesional
Documente Cultură
x2
x1
x1
and =
x1
Fishers solution
Fisher suggested maximizing the difference between the means,
normalized by a measure of the within-class scatter
For each class we define the scatter, an equivalent of the variance, as
2 = 2
where the quantity 12 + 22 is called the within-class scatter of the
projected examples
1 2 2
12 +22
x2
1
x1
The scatter of the projection can then be expressed as a function of the scatter
matrix in feature space
2 = 2 = 2 =
= =
12 + 22 =
Similarly, the difference between the projected means can be expressed in terms
of the means in the original feature space
1 2
= 1 2
= 1 2 1 2
The matrix is called the between-class scatter. Note that, since is the outer
product of two vectors, its rank is at most one
=0
2 2 = 0
Dividing by
= 0
= 0
1
= 0
1
Solving the generalized eigenvalue problem (
= ) yields
= arg max
1
=
1 2
Example
10
x2
4
.8
.4
2.64
2 =
1.84
.04
2.64
29.16
21.6
16.0
2.64
wLDA
0
x1
10
.44
5.28
=
= 0
1
1 5.08
11.89 8.81 1
.91
=
15.65
=
2
2
5.08 3.76 2
.39
8.81
= 0 = 15.65
3.76
Or directly by
1
=
1 2 = .91 .39
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
LDA, C classes
Fishers LDA generalizes gracefully for C-class problems
Instead of one projection , we will now seek ( 1) projections
[1 , 2 , 1 ] by means of ( 1) projection vectors arranged by
columns into a projection matrix = [1 |2 | |1 ]:
= =
Derivation
x2
where =
SW1
=1
S B1
SB2
SB3
3
SW3
SW2
x1
Similarly, we define the mean vector and scatter matrices for the
projected samples as
1
= N
=1
=1
And we will seek the projection matrix that maximizes this ratio
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
= arg max
= 0
NOTES
is the sum of matrices of rank 1 and the mean vectors are
1
constrained by
=1 =
LDA can be derived as the Maximum Likelihood method for the case of
normal class-conditional densities with equal covariance matrices
10
60
40
Sensor response
Results
20
-20
-40
0
50
Sulaw esy
100
Kenya
PCA
axis 3
25
20
15
7.4
7.38
7.36
7.34
7.32
100
10
-260
-240
90
80
-220
-200
Sumatra
200
Colombia
1
5
1
1
11
5 5
1
1
1
1
1 31311 1111
5 5 55 5
111
1113
3
1
1
1
5
3
1
1
1
55 5
3
1
1 13 14
33333131133
5
55555555
5
5
5
5
333331313313
5 5 5555 5 5
3 4444 4 4
3 333
5 55 555
331 3
55 55
4 44
3
3
3
4
2
3
333 4 444 4
2
2
4
22
2
5
333 44444 4
2
2
44 4
2
22222
4
22
44
44
4 4 4 2 22 222 22
4444 4
2
2
222 22 222
4
4
4 44 2 2 2 2
2
2
2
2
7.42
axis 3
30
150
LDA
2
22311
4
322333
113
45
45
23 32 242 41
15 5
4
4
3
2
1
1
452223
4
5
32 11
5
433311
43
441
312
4
43
532 4
2 2
2
5 131212
5
2
5
2
2
2
1
4
2
3
2
1
5
4
4
2
2
3
54 5 2
33
435
31
43142 4
1
34
44
2 5 135 5 1 4 2
1154355
51
51415
5551353
4
5
5
2
1
4
35
4
5 4
3
3 4
2
5
4
234311
2511 4 2
5
413
1 4
3 3 5433
51 2 2 51 51
3 213 35 52
1
5
1 5
35
Arabian
70
-1.96
-1.94
0.35
-1.92
-1.9
axis 2
axis 1
0.4
-1.88
0.3
axis 2
axis 1
11
Limitations of LDA
LDA produces at most 1 feature projections
If the classification error estimates establish that more features are needed,
some other method must be employed to provide those additional features
2
1
1= 2=
1
2
2
LD
x2
x1
12
Variants of LDA
Non-parametric LDA (Fukunaga)
1
The method used in OLDA combines the eigenvalue solution of
and the GramSchmidt orthonormalization procedure
OLDA sequentially finds axes that maximize the Fisher criterion in the subspace
orthogonal to all features already extracted
OLDA is also capable of finding more than ( 1) features
It has been shown that the hidden layers of multi-layer perceptrons perform nonlinear discriminant analysis by maximizing [ ], where the scatter matrices
are measured at the output of the last hidden layer
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
13
In other words, EPP seeks projections that separate clusters as much as possible and
keeps these clusters compact, a similar criterion as Fishers, but EPP does NOT use class
labels
Once an interesting projection is found, it is important to remove the structure it reveals
to allow other interesting views to be found more easily
Uninteresting
Interesting
x2
x2
x1
x1
14
, ,
The original method did not obtain an explicit mapping but only a lookup table for
the elements in the training set
Newer implementations based on neural networks do provide an explicit mapping for
test data and also consider cost functions (e.g., Neuroscale)
x2
P j
Pj
d(Pi, Pj)= d(Pi, Pj) " i,j
P i
Pi
x1
x3
x1
15