Documente Academic
Documente Profesional
Documente Cultură
Chapter 250
Principal
Components
Analysis
Introduction
This chapter describes how to obtain a principal components analysis (PCA) for a subset list of
genes using the GESS: Principal Components Analysis procedure. Either a correlation or a
variance-covariance matrix may be used. Options are included for varimax factor rotation,
quartimax factor rotation, or no factor rotation.
Before running this procedure, output (.ges) files containing a single expression value for each
gene on each array must be obtained using the appropriate pre-processing procedure in GESS.
Background
The general purpose of PCA is reduction of dimensionality. In microarray studies, the expression
of a large number of genes is obtained for several individuals or experimental units. Each gene
may be thought of as a factor, variable, or component of the individuals. If only two genes were
measured, a simple scatter plot of the values of each individual would show the relationship of
these two factors (genes). However, when larger numbers of genes are measured, visualization of
the resulting multidimensional space is impossible. Principal components analysis is a method for
reducing the multidimensional space down to only a few dimensions.
Principal components are produced by creating a new coordinate system or space for the factors
(genes). A new set of uncorrelated factors is produced. These factors are ordered so that the first
few retain most of the variation present in all of the original gene factors.
PCA is not a visual substitute for hypothesis testing. It may be useful for viewing patterns in gene
expression (grouping or separation), but no probability statements can be attached to PCA. In
many circumstances, PCA may be used after hypothesis testing has been used to identify a
smaller group of candidate genes.
The technical details of PCA are found in Chapter 425, Principal Components Analysis, of the
NCSS manuals. That chapter gives mathematical details of the procedure as well as discussion of
the number of factors and the types of rotation.
An excellent discussion of PCA in microarray studies is found in Draghici (2003).
250-2 Principal Components Analysis
Analysis Steps
Following are the recommended steps for running a PCA on microarray data.
Step 1 – Pre-Processing
Run the appropriate pre-processing procedure (e.g., GenePix Pre-processing or Affymetrix Pre-
processing) to prepare data (.ges) files for statistical analysis. The .ges files are created when a
variable name is entered in the Output File Names Variable box on the variables tab of the pre-
processing procedure window.
Procedure Options
This section describes the options available in this procedure.
Variables Tab
These options specify the variables that will be used in the analysis.
Variables
GES Files Variable
Specify the variable (column) containing the list of .ges files that are to be analyzed. This is the
only variable needed for this analysis.
Principal Components Analysis 250-3
Genes to be Analyzed
Enter a list of gene names on which the principal components analysis is to be done. The genes
may be entered directly, or the * character may be used to specify all genes with a particular
beginning. A list of genes may be enter as a file, using the notation file(...\...\filename.txt). The
file must contain a list of gene names or IDs, each on a separate line. A list of genes may also be
entered as a column of the spreadsheet, using the notation var(variable name).
EXAMPLES:
Blank
spike1
spike3
spike5
spike7
spike9
AA44719
NM_00582
NM_04762
NM_27564
var(OutputGenes) (all names in the spreadsheet variable with the variable name OutputGenes)
PCA Options
These options are used to specify the details of the principal components analysis.
Factor Rotation
Specifies the type of rotation, if any, that should be applied to the solution. If rotation is desired,
either varimax or quartimax rotation is available.
• None
No rotation is done. This is the suggested option for PCA.
• Varimax
Varimax rotation is the most popular orthogonal rotation technique. The axes are rotated to
maximize the sum of the variances of the squared loadings within each column of the
loadings matrix. This forces the loadings to be either large or small. The hope is that by
rotating the factors, you will obtain new factors that are each highly correlated with only a
few of the original variables, which simplifies the interpretation of the factor.
Another way of stating the goal of varimax rotation is that it clusters the variables into
groups; each 'group' is actually a new factor.
250-4 Principal Components Analysis
• Quartimax
Quartimax rotation is similar to varimax rotation except that the rows of G are maximized
rather than the columns of G. This rotation is more likely to produce a 'general' factor than
will varimax. Often, the results are quite similar.
Matrix Type
This option indicates whether the analysis is to be run on a correlation or covariance matrix.
Normally, the analysis is run on the scale-invariant correlation matrix since the scale of the
variables changes the analysis when the covariance matrix is used. For example, when a
covariance matrix is used, a variable that is measured in yards results in a different analysis than
if it were measured in feet.
Number of Factors Method
This option specifies which of the following three methods is used to set the number of factors
(principal components) output from the analysis.
• Percent of Eigenvalues
Specify the total percent of variation that must be accounted for. Enough factors (principal
components) will be included to account for this percentage (or slightly greater) of the
variation in the data.
• Number of Factors
Specify the number of factors (principal components) to keep.
• Eigenvalue Cutoff
Specify the minimum eigenvalue amount. All factors whose eigenvalues are greater than or
equal to this value will be retained. Older statistical texts suggest that you should only keep
factors whose eigenvalues are greater than one.
Percent of Eigenvalues
Specify the percent of variation required to be accounted for by the principal components.
Enough factors (principal components) will be included to account for this percentage (or slightly
greater) of the variation in the data.
Recommended Percent: 80.
Number of Factors
Specify the factors (principal components) to be retained in the analysis.
Recommended Number of Factors: 2-5.
Eigenvalue Cutoff
Specify the minimum eigenvalue amount required to retain the factor (principal component). All
factors whose eigenvalues are greater than or equal to this value will be retained.
Recommended Eigenvalue Cutoff: Older statistical texts suggest that you should only keep
factors whose eigenvalues are greater than one.
Principal Components Analysis 250-5
Reports Tab
The options on this panel control which reports and plots are generated.
Select Reports
The following options are used to determine the reports that will be displayed.
Descriptive Statistics … Score Report
Check the boxes to obtain the desired reports.
Report Options
These options determine the format of the reports.
Precision
Specifies whether unformatted numbers are displayed as single (7-digit) or double (13-digit)
precision numbers.
• Single
Unformatted numbers are displayed with 7-digits. This is the default setting. All reports have
been formatted for single precision.
• Double
Unformatted numbers are displayed with 13-digits. This option is most often used when the
extremely accurate results are needed for further calculation. Double precision numbers will
require more space than allotted, potentially resulting in unaligned output. This option is
provided for those instances when accuracy is more important than format alignment.
250-6 Principal Components Analysis
COMMENTS:
This option does not affect formatted numbers such as probability levels.
This option only influences the format of the numbers as they are output. All calculations are
performed in double precision regardless of selection.
Variable Names
Specify whether to use variable names, variable labels, or both to label output reports.
• Names
Variable Names are the column headings that appear on the database. They may be modified
by selecting the Variable Info tab at the bottom of the spreadsheet or by clicking the right
mouse button while the mouse is pointing to the column heading.
• Labels
This refers to the optional labels that may be specified for each variable. Pressing the
Variable Info tab at the bottom of the spreadsheet window allows you to enter them.
COMMENTS:
1. Most reports are formatted to receive about 12 characters for variable names.
2. Variable Names cannot contain blanks or math symbols (like + - * / . ,), but variable
labels can.
Alpha
The alpha value that is used in the residual reports to test if the observation is an outlier.
RECOMMENDED: .05 or perhaps .10.
RANGE: .001 to .400.
Minimum Loading
Specifies the minimum absolute value that a loading can have and still remain in the Variable List
report. No variables are kept in Variable List Report with |loadings| < this amount.
RECOMMENDED: .4
RANGE: Since the loading is a correlation, its absolute value can range between 0 and 1.
Storage Tab
Storage Variable(s)
Factor Scores Storage Variable(s)
You can automatically store the factor scores for each row into the variables specified here. These
scores are generated for each row of data in which all independent variable values are
nonmissing.
WARNING: Existing data in these variables will be replaced with the new values automatically
when the procedure is run.
Principal Components Analysis 250-7
Template Tab
The options on this panel allow various sets of options to be loaded (File menu: Load Template)
or stored (File menu: Save Template). A template file contains all the settings for this procedure.
Step 1 – Pre-Processing
The 10 arrays used in the example have already been pre-processed using one of the pre-
processing procedures.
This report lets us compare the relative sizes of the standard deviations. In this data set, they are
all about the same size, so either the correlation or the covariance matrix would be appropriate.
The correlation matrix was used in this example.
Count, Mean, and Standard Deviation
These are the familiar summary statistics of each gene. They are displayed to assure that the correct
genes have been specified.
Communality
The communality shows how well this gene is predicted by the retained factors (principal
components). It is the R2 that would be obtained if the values for this gene were regressed on the
factors that were kept. In this example, three factors were kept, and the communality is very high
for all genes.
Correlation Section
Correlation Section
Gene
Gene 31962_at 37001_at 37029_at 37725_at 38730_at
31962_at 1.000000 0.884614 -0.884496 -0.951134 -0.951301
37001_at 0.884614 1.000000 -0.963243 -0.941558 -0.966382
37029_at -0.884496 -0.963243 1.000000 0.956572 0.979432
37725_at -0.951134 -0.941558 0.956572 1.000000 0.979053
38730_at -0.951301 -0.966382 0.979432 0.979053 1.000000
39425_at 0.944943 0.951462 -0.962088 -0.977044 -0.989917
40515_at 0.941978 0.954740 -0.969694 -0.991261 -0.984643
100084_at -0.912223 -0.953814 0.977394 0.974738 0.981873
101482_at -0.935294 -0.904211 0.930076 0.926151 0.961415
94766_at 0.915796 0.965694 -0.978651 -0.974056 -0.990241
Phi=0.959061 Log(Det|R|)=0.000000 Bartlett Test=0.00 DF= Prob=0.000000
Principal Components Analysis 250-11
Gene
Gene 39425_at 40515_at 100084_at 101482_at 94766_at
31962_at 0.944943 0.941978 -0.912223 -0.935294 0.915796
37001_at 0.951462 0.954740 -0.953814 -0.904211 0.965694
37029_at -0.962088 -0.969694 0.977394 0.930076 -0.978651
37725_at -0.977044 -0.991261 0.974738 0.926151 -0.974056
38730_at -0.989917 -0.984643 0.981873 0.961415 -0.990241
39425_at 1.000000 0.988392 -0.991480 -0.963947 0.990537
40515_at 0.988392 1.000000 -0.989979 -0.935585 0.983204
100084_at -0.991480 -0.989979 1.000000 0.946251 -0.987417
101482_at -0.963947 -0.935585 0.946251 1.000000 -0.955843
94766_at 0.990537 0.983204 -0.987417 -0.955843 1.000000
Phi=0.959061 Log(Det|R|)=0.000000 Bartlett Test=0.00 DF= Prob=0.000000
Gene
Gene 39425_at 40515_at 100084_at 101482_at 94766_at
31962_at ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| |||||||||||||||||||
37001_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
37029_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
37725_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
38730_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
39425_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
40515_at |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
100084_at |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
101482_at |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
94766_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
Phi=0.959061 Log(Det|R|)=0.000000 Bartlett Test=0.00 DF= Prob=0.000000
The report gives the correlations for a test of the overall correlation structure in the data. In this
example, we notice very high correlation values. The Gleason-Staelin redundancy measure, phi,
is 0.959, which is quite large. There is apparently some correlation structure in this data set that
can be modeled. If all the correlations were small, there would be no need for a PCA.
Correlations
The simple correlations between each pair of variables.
Phi
This is the Gleason-Staelin redundancy measure of how interrelated the variables are. A zero
value of ϕ1 means that there is no correlation among the variables (genes), while a value of one
indicates perfect correlation among the variables. This coefficient may have a value less than 0.5
even when there is obvious structure in the data, so care should to be taken when using it. This
statistic is especially useful for comparing two or more sets of data. The formula for computing
ϕ2 is:
250-12 Principal Components Analysis
p p
∑ ∑ r2 ij -p
i =1 j =1
ϕ=
p(p - 1)
Log(Det|R|)
This is the log (base e) of the determinant of the correlation matrix. If the covariance matrix is
used, this is the log (base e) of the determinant of the covariance matrix.
Bartlett Test, DF, Prob
This is Bartlett’s sphericity test (Bartlett, 1950) for testing the null hypothesis that the correlation
matrix is an identity matrix (all correlations are zero). If a probability (Prob) value greater than
0.05 is obtained, PCA may not be useful. The test is valid for large samples (N>150). It uses a
Chi-square distribution with p(p-1)/2 degrees of freedom. This test is only available when the
correlation matrix is chosen. The formula for computing this test is:
χ2 =
(11 + 2p - 6N ) Log R
e
6
Bar Chart of Absolute Correlation Section
This chart graphically displays the absolute values of the correlations. It lets you quickly find
high and low correlations.
Eigenvalues
Eigenvalues
Individual Cumulative
No. Eigenvalue Percent Percent Scree Plot
1 9.630263 96.30 96.30 ||||||||||||||||||||
2 0.165738 1.66 97.96 |
3 0.088083 0.88 98.84 |
4 0.058978 0.59 99.43 |
5 0.029022 0.29 99.72 |
6 0.013452 0.13 99.86 |
7 0.010674 0.11 99.96 |
8 0.003479 0.03 100.00 |
9 0.000312 0.00 100.00 |
10 0.000000 0.00 100.00
Eigenvalue
The eigenvalues are often used to determine how many factors to retain. (In this example, only
the first eigenvalue would be retained.)
When the PCA is run on the correlations, one rule-of-thumb is to retain those factors whose
eigenvalues are greater than one. The sum of the eigenvalues is equal to the number of variables.
Hence, in this example, the first factor retains the information contained in 9.63 of the original
genes.
When the PCA is run on the covariances, the sum of the eigenvalues is equal to the sum of the
variances of the variables.
Individual and Cumulative Percents
The first column gives the percentage of the total variation in the variables accounted for by this
factor. The second column is the cumulative total of the percentage. Some authors suggest that
Principal Components Analysis 250-13
the user pick a cumulative percentage, such as 80% or 90%, and keep enough factors to attain this
percentage.
Scree Plot
This is a rough bar plot of the eigenvalues. It enables one to quickly note the relative size of each
eigenvalue. Many authors recommend it as a method of determining how many factors to retain.
The word scree, first used by Cattell (1966), is usually defined as the rubble at the bottom of a
cliff. When using the scree plot, one must determine which eigenvalues form the “cliff” and
which form the “rubble.” Only the factors that make up the cliff should be kept. Cattell and
Jaspers (1967) suggest keeping those that make up the cliff plus the first factor of the rubble.
Eigenvectors
Eigenvectors
Factors
Gene Factor1 Factor2 Factor3
31962_at -0.306025 -0.713111 -0.286313
37001_at -0.311531 0.410212 -0.125888
37029_at 0.315369 -0.381861 -0.117416
37725_at 0.317630 0.080305 0.447482
38730_at 0.321315 0.009693 -0.024635
39425_at -0.320529 -0.055078 0.073735
40515_at -0.319874 0.030080 -0.289626
100084_at 0.319100 -0.172377 -0.040621
101482_at 0.310574 0.329642 -0.760023
94766_at -0.319949 0.168886 0.137616
Eigenvector
The eigenvectors are the weights that relate the scaled original variables (genes), xi = (Xi-
Meani)/Sigmai, to the factors. For example, the first factor, Factor1, is a contrast of the weighted
average of Group A versus the weighted average of Group B, the weight of each variable given
by the corresponding element of the first eigenvector. Mathematically, the relationship is given
by:
Factor1 = v11 x11 + v12 x12 +...+ v1 p x1 p
These coefficients may be used to determine the relative importance of each variable in forming
the factor. Often, the eigenvectors are scaled so that the variances of the factor scores are equal to
one. These scaled eigenvectors are given in the Score Coefficients section described later.
250-14 Principal Components Analysis
Factor Loadings
Factor Loadings
Factors
Gene Factor1 Factor2 Factor3
31962_at -0.949677 -0.290314 -0.084974
37001_at -0.966765 0.167001 -0.037362
37029_at 0.978675 -0.155459 -0.034848
37725_at 0.985690 0.032693 0.132807
38730_at 0.997126 0.003946 -0.007311
39425_at -0.994688 -0.022423 0.021884
40515_at -0.992656 0.012246 -0.085957
100084_at 0.990251 -0.070176 -0.012056
101482_at 0.963793 0.134200 -0.225565
94766_at -0.992888 0.068755 0.040843
Factor Loadings
These are the correlations between the variables (genes) and factors.
Bar Chart of Absolute Factor Loadings
This chart graphically displays the absolute values of the factor loadings. It allows the viewer to
quickly interpret the correlation structure. By looking at which variables (genes) correlate highly
with a factor, underlying structure it might represent can be determined.
Interpretation of the Example
We now go through the interpretation of each factor. Factor one appears to be the contrast of the
genes of the two groups (as expected). Although Factor2 is probably not very useful, it may be
interpreted as a contrast of 31962_at + 37029_at versus 37001_at + 101482_at. Factor3 has heavy
weight with 101482_at. If the functions of the genes were known, the viewer could try to attach
meaning to these patterns.
Principal Components Analysis 250-15
Communalities
Communalities
Factors
Gene Factor1 Factor2 Factor3 Communality
31962_at 0.901887 0.084282 0.007221 0.993389
37001_at 0.934635 0.027889 0.001396 0.963920
37029_at 0.957804 0.024168 0.001214 0.983186
37725_at 0.971585 0.001069 0.017638 0.990291
38730_at 0.994259 0.000016 0.000053 0.994328
39425_at 0.989405 0.000503 0.000479 0.990387
40515_at 0.985365 0.000150 0.007389 0.992904
100084_at 0.980598 0.004925 0.000145 0.985668
101482_at 0.928897 0.018010 0.050880 0.997787
94766_at 0.985827 0.004727 0.001668 0.992223
Communality
The communality is the proportion of the variation of a variable that is accounted for by the
factors that are retained. It is the R² value that would be achieved if this variable were regressed
on the retained factors. This table value gives the amount added to the communality by each
factor.
Bar Chart of Communalities
This chart graphically displays the values of the communalities.
Interpretation
This report is provided to summarize the factor structure. Variables with an absolute loading
greater than the amount set in the Minimum Loading option are listed under each factor. Using
this report, you can quickly see which variables are related to each factor. Notice that it is
250-16 Principal Components Analysis
possible for a variable to have high loadings on several factors. In this example, all the important
loadings are associated with Factor1.
Score Coefficients
Score Coefficients
Factors
Gene Factor1 Factor2 Factor3
31962_at -9.861384E-02 -1.751647 -0.9647075
37001_at -0.1003882 1.007622 -0.4241694
37029_at 0.1016249 -0.9379836 -0.3956228
37725_at 0.1023534 0.1972568 1.507755
38730_at 0.1035408 2.380825E-02 -8.300588E-02
39425_at -0.1032878 -0.1352902 0.2484457
40515_at -0.1030767 7.388766E-02 -0.9758725
100084_at 0.102827 -0.4234185 -0.1368682
101482_at 0.1000796 0.8097138 -2.560834
94766_at -0.1031009 0.4148414 0.4636864
Score Coefficients
These are the coefficients that are used to form the factor scores. The factor scores are the values
of the factors for a particular row of data. These score coefficients are similar to the eigenvectors.
They have been scaled so that the scores produced have a variance of one rather than a variance
equal to the eigenvalue. This causes each of the factors to have the same variance.
These scores would be used to calculate the factor scores for new rows not included in the
original analysis.
Residual Section
Residual Section
Row T2 T2 Prob Q0 Q1 Q2 Q3
1 8.10 1.0000 10.26 0.39 0.20 0.19
2 8.10 1.0000 8.83 0.12 0.10 0.08
3 8.10 1.0000 8.00 0.30 0.30 0.14
4 8.10 1.0000 9.19 0.33 0.27 0.06
5 8.10 1.0000 8.53 0.40 0.14 0.14
6 8.10 1.0000 8.17 0.88 0.12 0.12
7 8.10 1.0000 7.62 0.24 0.18 0.05
8 8.10 1.0000 10.16 0.11 0.09 0.09
9 8.10 1.0000 10.60 0.34 0.33 0.11
10 8.10 1.0000 8.64 0.21 0.11 0.07
This report is useful for detecting outliers – observations that are very different from the bulk of
the data. To do this, two quantities are displayed: T² and Qk. We will now define these two
quantities.
T² measures the combined variability of all the variables in a single observation. Mathematically,
T² is defined as:
T 2 = [ x - x ] ′ S -1 [ x - x ]
where x3 represents a p-variable observation, x 4 represents the p-variable mean vector, and S-15
represents the inverse of the covariance matrix.
T is not affected by a change in scale. It is the same whether the analysis is performed on the
covariance or the correlation matrix. T² gives a scaled distance measure of an individual
Principal Components Analysis 250-17
observation from the overall mean. The closer an observation is to its mean, the smaller will be
the value of T².
If the variables follow a multivariate normal distribution, then the probability distribution of T²
may be related to the common F distribution using the formula:
p(n - 1)
T 2p,n,α = F p,n- p,α
n- p
Using this relationship, we can perform a statistical test at a given level of significance to
determine if the observation is significantly different from the vector of means. You set the α
value using the Alpha option. Since this test is being performed N times, you would anticipate
about N(1-α) observations to be significant by chance variation. In our current example, rows two
and three are starred (which means they were significant at the .05 significance level). You would
probably want to check for data entry or transcription errors. (Of course, in this data set, these
rows were made to be outliers.)
T² is really not part of a normal PCA since it may be calculated independently. It is presented to
help detect observations that may have an undue influence on the analysis. You can read more
about its use and interpretation in Jackson (1991).
The other quantity shown on this report is Qk. Qk represents the sum of squared residuals when an
observation is predicted using the first k factors. Mathematically, the formula for Qk is:
Q k = ( x - x$ ) ′( x - x$ )
p
∑ (x - k x$ i )
2
= i
i =1
( )
p 2
= ∑ λ i pc i
i = k +1
Here kxi6 refers to the value of variable i predicted from the first k factors, λi7 refers to the ith
eigenvalue, and pci8 is the score of the ith factor for this particular observation. Further details are
given in Jackson (1991) on pages 36 and 37.
An upper limit for Qk is given by the formula:
1/ h
⎡ z 2b h 2 bh(h - 1) ⎤
Qα = a⎢ α + +1 ⎥
⎢⎣ a a2 ⎥⎦
where
p
a= ∑λ
i =k +1
i
p
b= ∑λ
i =k +1
2
i
p
c= ∑λ
i =k +1
3
i
2ac
h =1-
3 b2
and
250-18 Principal Components Analysis
zα is the upper normal deviate of area α if h is positive or the lower normal deviate of area α if h
is negative.
This limit is valid for any value of k, whether too many or too few factors are kept. Note that
these formulas are for the case when the correlation matrix is being used. When the analysis is
being run on the covariance matrix, the pci’s must be adjusted. Further details are given in
Jackson (1991).
Notice that significant (starred) values of Qk indicate observations that are not duplicated well by
the first k factors. These should be checked to see if they are valid. Qk and T² provide an initial
data screening tool.
Interpretation of the Example
None of the rows shows any sign of being an outlier. The T2 Probs are all 1.
Factor Score
Factor Score
Factors
Row Factor1 Factor2 Factor3
1 -1.0120 1.0718 0.3154
2 -0.9514 -0.2871 -0.4970
3 -0.8938 -0.1370 1.3564
4 -0.9587 0.6131 -1.5454
5 -0.9190 -1.2364 0.2941
6 0.8705 -2.1380 -0.1182
7 0.8756 0.6267 1.2020
8 1.0215 0.3782 -0.0955
9 1.0323 0.3053 -1.5639
10 0.9351 0.8035 0.6521
This report presents the individual factor scores scaled so each column has a mean of zero and a
standard deviation of one. These are the values that are plotted in the plots that follow. There is
one row of score values for each observation and one column for each factor that was kept.
98 9 8
6 7 10 6
10 7
0.75 0.75
Score1
Score1
0.00 0.00
-0.75 -0.75
5 2 3 4 1 4 2 5
1
3
-1.50 -1.50
-2.50 -1.50 -0.50 0.50 1.50 -2.00 -1.13 -0.25 0.63 1.50
Score2 Score3
Principal Components Analysis 250-19
Factor Scores
1.50
1
10
4 7
0.50
9 8
3
Score2
2
-0.50
5
-1.50
-2.50
-2.00 -1.13 -0.25 0.63 1.50
Score3
This set of plots shows each factor plotted against every other factor. The first k factors (where k
is the number of large eigenvalues) usually show the major structure that will be found in the
data. The rest of the factors show outliers and linear dependencies. The numbers correspond to
the row ordered individuals.
0.50 0.50
Loading1
Loading1
0.00 0.00
-0.50 -0.50
1 2 1 2
-1.00 6 7 10 -1.00 7 6 10
-0.30 -0.18 -0.05 0.08 0.20 -0.25 -0.15 -0.05 0.05 0.15
Loading2 Loading3
Factor Loadings
0.20
2
9
0.08 10
4
7
Loading2
5
6
-0.05
8
3
-0.18
-0.30 1
shown in the Score Coefficients section). The plot allows you to find genes that are highly
correlated with both factors. It is anticipated that this will aid in the interpretation of the factors.
In this example, genes 3 (37029_at), 4 (37725_at), 5 (38730_at), 8 (100084_at), and 9
(101482_at) have means that are large in Group A and small in Group B. Genes 1 (31962_at), 2
(37001_at), 6 (39425_at), 7 (40515_at), and 10 (94766_at) have means that are large in Group B
and small in Group A.