Documente Academic
Documente Profesional
Documente Cultură
Based on 1. Statistics and Data Analysis in Geology, J.C. Davis, New York, John Wiley & sons, 2nd ed.,1996 2. Chemometrics: A Textbook, Amsterdam, D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte, and L. Kaufman, Elsevier,1988 3. Course note: Multivariate Data Analysis and Chemometrics, B. Jrgensen, Department of Statistics, University of Southern Denmark, 2003 4. Multi- and Megavariate Data Analysis- Principles and Applications, L. Eriksson, E. Johansson, N. Kettaneh-Wold, and S. Wold, UMETRICS, 2001 5. Matlab online Manual, The MathWorks, Inc.
Scaling
(Adapted from Multi- and Megavariate Data Analysis- Principles and Applications, L. Eriksson, E.Johansson, N. Kettaneh-Wold, and S. Wold, UMETRICS, 2001)
1) Pre-treatment of Data Matrix-Scaling Unless the data are normalized, a variable with a large variance will dominate Most common scaling technique Unit variance (UV) scaling
Column: Variable
UV Scaling
a11 am1
Sk1
Row: Sample
(a11 / sk 1 ) (am1 / sk 1)
Note: However, the mean values still remain different Therefore mean-centering as a second part of pre-data processing Step1) Average value of each variable is calculated Step2) Subtracted from the data
Mean Centering
Note: Unit Variance Scaling + Mean Centering= Auto-Scaling Sometimes we dont need UV scaling ex. same unit like spectroscopic data
We have 20 samples with 2 variables. The size of data matrix will be 202.
X2
2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
X1
X2
2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
2.5 2.0 1.5 1.0 0.5
X1
Avg. of X1
Avg. of X2
(1.9995,
1.618)
X2
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
X1
1) Calculation of Covariance matrix(S) of Data Matrix(X): If the data matrix X has m-rows and n-columns, covariance matrix S is
1 )X T X S = cov( X ) = ( m 1
From example, we have 20 rows and 2 columns. Therefore, the size of covariance matrix, S, is 2 by 2. By definition of covariance matrix,
How do we get a covariance matrix, S, via computer? Use the command, cov(a), in Matlab where a is a data matrix
the variance of X2
Variance (1 dimensional concept): Measure of the spread of data in a given data set
var( X ) =
(X
i =1
X )( X i X ) (n 1)
Covariance (Multi-dimensional concept): Measure of the spread of data between dimensions (variables)
cov( X , Y ) =
(X
i =1
X )(Yi Y ) (n 1)
1. From the eigenvalue properties, we already know. for symmetric matrices, their eigenvectors always are at right angles to each other: ORTHOGONAL!!! 2. Therefore eigenvectors of covariance matrix are orthogonal! : VERY IMPORTANT concept in PCA 3. If we measure m variables, we can compute an mm covariance matrix. Then we can extract m eigenvalues and m eigenvectors.
0.6881 0.5929
0.5929 0.9026
=0
2 1.5907 + 0.27 = 0
its eigenvectors are orthogonal (90o)
1 = 1.3978, 2 = 0.1928
Eigenvalues of covariance matrix S
1 = 1.3978
0.5929 x1 0.7097 0.5929 x1 0.6881 1.3978 = =0 0.5929 0.9026 1.3978 x2 0.5929 0.4952 x2
Is it possible to solve this problem? (NOT for the case of No, because of two same values (-0.5929)
x1 = x2 = 0
x1 = x2 = 0 )
2) Calculation of the corresponding Eigenvectors of Covariance matrix (S) In Matlab, use the function of eigenvector decomposition. The command is [P, ]=eig(S). Then you will get followings in Matlab.
0 0.1928 = 0 1.3978
For
1 = 1.3978
Eigenvectors
1 = 1.3978
Eigenvectors
2 = 0.1928
Eigenvectors can also be x1=0.7675 and x2=0.6411
for
1 = 1.3978
Eigenvectors can also be x1=0.6411 and x2=-0.7675
3) Graphical representations of eigenvalues and eigenvectors Covariance matrix of scaled data matrix,S Eigenvalues of S Eigenvectors of S
1 = 1.3978
and
2 = 0.1928
1 = 1.3978 , Eigenvectors: -0.6411 & 0.7675 or (0.6411 & -0.7675) For 2 = 0.1928, Eigenvectors: -0.7675 & -0.6411 or (0.7675 & 0.6411)
For We already know Eigenvectors: the orientations of the principal axes of the ellipse
Slope of major axis =ratio of eigenvector=0.7675:(-0.6411) Slope of minor axis=ratio of eigenvector=-0.6411:(-0.7675) - Slope + Slope
2 =0.1928 1 =1.3978
c.f. in 3D
Ellipsoid
X2
Eigenvalues:
Orthogonal (90o)
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
X1
PC1(Principal Component 1)
Slope of major axis = ratio of eigenvector of the largest eigenvalue(1)
0 0.1928 = 0 1.3978
How about eigenvalues? 1. Total eigenvalues will be 0.1928+1.3978=1.5906 and this is the same as total variance 2. Eigenvalues represent the lengths of the two principal axes of ellipse. Therefore the axes represent the total variance of the data set. 3. The first principal axis contains 1.3978/1.5906=87.88% of the total variance. Second principal axis represents 0.1928/1.5906=12.12% of the total variance.
5) Advantages of PCA From this example, Lets suppose..we need to reduce our system to only one variable. Then we need to discard either variable X2 or X1. It means we will lose 56.74% or 43.26% of the total variance. If, however, we convert our data set to scores on the first principal axis(PC 1), we lose only 12.12% of the variation in our data set. This is a big advantage of PCA!!!
1) Mathematical representations of transformation of axes from X1-X2 to PC1-PC2 (PC: Principal Component)
Slope of major axis = ratio of eigenvector of the largest eigenvalue(1)
PC1
2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -2.5 -2.5
PC1 = 1 X 1 + 2 X 2
X2
PC 2 i = 0.7675 X 1i + 0.6411 X 2 i
Scores
Loadings
2) Calculations of scores
X ] ([ P ] ) 1 = [T ] [
T
PC 2 i = 0.7675 X 1i + 0.6411 X 2 i
Scores
Loadings Therefore,
PC2
PC1
0.1005 0.368 0.1588 0.3469 1.0995 0.008 0.8490 0.6987 0.7675 0.6411 0.5495 0.538 = 0.7666 0.0606 0.6411 0.7675 0.2205 0.812 0.6898 0.4818
Scores
rotation
X2
PC2
0.0 -0.5
rotation
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
X1
PC1