Sunteți pe pagina 1din 20

250-1

Chapter 250

Principal
Components
Analysis
Introduction
This chapter describes how to obtain a principal components analysis (PCA) for a subset list of
genes using the GESS: Principal Components Analysis procedure. Either a correlation or a
variance-covariance matrix may be used. Options are included for varimax factor rotation,
quartimax factor rotation, or no factor rotation.
Before running this procedure, output (.ges) files containing a single expression value for each
gene on each array must be obtained using the appropriate pre-processing procedure in GESS.

Background
The general purpose of PCA is reduction of dimensionality. In microarray studies, the expression
of a large number of genes is obtained for several individuals or experimental units. Each gene
may be thought of as a factor, variable, or component of the individuals. If only two genes were
measured, a simple scatter plot of the values of each individual would show the relationship of
these two factors (genes). However, when larger numbers of genes are measured, visualization of
the resulting multidimensional space is impossible. Principal components analysis is a method for
reducing the multidimensional space down to only a few dimensions.
Principal components are produced by creating a new coordinate system or space for the factors
(genes). A new set of uncorrelated factors is produced. These factors are ordered so that the first
few retain most of the variation present in all of the original gene factors.
PCA is not a visual substitute for hypothesis testing. It may be useful for viewing patterns in gene
expression (grouping or separation), but no probability statements can be attached to PCA. In
many circumstances, PCA may be used after hypothesis testing has been used to identify a
smaller group of candidate genes.
The technical details of PCA are found in Chapter 425, Principal Components Analysis, of the
NCSS manuals. That chapter gives mathematical details of the procedure as well as discussion of
the number of factors and the types of rotation.
An excellent discussion of PCA in microarray studies is found in Draghici (2003).
250-2 Principal Components Analysis

Analysis Steps
Following are the recommended steps for running a PCA on microarray data.

Step 1 – Pre-Processing
Run the appropriate pre-processing procedure (e.g., GenePix Pre-processing or Affymetrix Pre-
processing) to prepare data (.ges) files for statistical analysis. The .ges files are created when a
variable name is entered in the Output File Names Variable box on the variables tab of the pre-
processing procedure window.

Step 2 – Obtain a List of Genes


A list of the genes to be analyzed is required for running Principal Components Analysis in
GESS. Often this list will be obtained by running one of the hypothesis testing procedures. When
running the hypothesis testing procedures, lists may be output into the spreadsheet, or obtained
from the output word processing window itself. The gene list may be typed individually or saved
as a file of names.

Step 3 – Spreadsheet Setup


Only a single column of pre-processed (.ges) data files is required to run GESS: Principal
Components Analysis.

Step 4 – Run the Analysis


Carefully select the number of factors method, the factor rotation method, and the matrix type
(for discussion of these, see Chapter 425, Principal Components Analysis, of the NCSS manuals).
The selected reports should include the descriptive statistics and the eigenvalue summary.

Procedure Options
This section describes the options available in this procedure.

Variables Tab
These options specify the variables that will be used in the analysis.

Variables
GES Files Variable
Specify the variable (column) containing the list of .ges files that are to be analyzed. This is the
only variable needed for this analysis.
Principal Components Analysis 250-3

Genes to be Analyzed
Enter a list of gene names on which the principal components analysis is to be done. The genes
may be entered directly, or the * character may be used to specify all genes with a particular
beginning. A list of genes may be enter as a file, using the notation file(...\...\filename.txt). The
file must contain a list of gene names or IDs, each on a separate line. A list of genes may also be
entered as a column of the spreadsheet, using the notation var(variable name).
EXAMPLES:
Blank

spike1
spike3
spike5
spike7
spike9

spike* (all names beginning with spike)

AA44719

NM_00582
NM_04762
NM_27564

cntrl* (all names beginning with cntrl)

file(C:\Microarray\genelist.txt) (all names in the genelist.txt file)

var(OutputGenes) (all names in the spreadsheet variable with the variable name OutputGenes)

PCA Options
These options are used to specify the details of the principal components analysis.
Factor Rotation
Specifies the type of rotation, if any, that should be applied to the solution. If rotation is desired,
either varimax or quartimax rotation is available.

• None
No rotation is done. This is the suggested option for PCA.

• Varimax
Varimax rotation is the most popular orthogonal rotation technique. The axes are rotated to
maximize the sum of the variances of the squared loadings within each column of the
loadings matrix. This forces the loadings to be either large or small. The hope is that by
rotating the factors, you will obtain new factors that are each highly correlated with only a
few of the original variables, which simplifies the interpretation of the factor.
Another way of stating the goal of varimax rotation is that it clusters the variables into
groups; each 'group' is actually a new factor.
250-4 Principal Components Analysis

• Quartimax
Quartimax rotation is similar to varimax rotation except that the rows of G are maximized
rather than the columns of G. This rotation is more likely to produce a 'general' factor than
will varimax. Often, the results are quite similar.
Matrix Type
This option indicates whether the analysis is to be run on a correlation or covariance matrix.
Normally, the analysis is run on the scale-invariant correlation matrix since the scale of the
variables changes the analysis when the covariance matrix is used. For example, when a
covariance matrix is used, a variable that is measured in yards results in a different analysis than
if it were measured in feet.
Number of Factors Method
This option specifies which of the following three methods is used to set the number of factors
(principal components) output from the analysis.

• Percent of Eigenvalues
Specify the total percent of variation that must be accounted for. Enough factors (principal
components) will be included to account for this percentage (or slightly greater) of the
variation in the data.

• Number of Factors
Specify the number of factors (principal components) to keep.

• Eigenvalue Cutoff
Specify the minimum eigenvalue amount. All factors whose eigenvalues are greater than or
equal to this value will be retained. Older statistical texts suggest that you should only keep
factors whose eigenvalues are greater than one.
Percent of Eigenvalues
Specify the percent of variation required to be accounted for by the principal components.
Enough factors (principal components) will be included to account for this percentage (or slightly
greater) of the variation in the data.
Recommended Percent: 80.
Number of Factors
Specify the factors (principal components) to be retained in the analysis.
Recommended Number of Factors: 2-5.
Eigenvalue Cutoff
Specify the minimum eigenvalue amount required to retain the factor (principal component). All
factors whose eigenvalues are greater than or equal to this value will be retained.
Recommended Eigenvalue Cutoff: Older statistical texts suggest that you should only keep
factors whose eigenvalues are greater than one.
Principal Components Analysis 250-5

Reports Tab
The options on this panel control which reports and plots are generated.

Select Reports
The following options are used to determine the reports that will be displayed.
Descriptive Statistics … Score Report
Check the boxes to obtain the desired reports.

Select Factor Plots


The following options are used to determine the plots that will be displayed.
Scores Plot and Loadings Plot
Check the boxes to obtain the desired factor plots.
Row Numbers
Check this box to obtain numbers near the points on the plot that correspond to the rows
(individuals, experimental units).
Ordered Gene Numbers
Check this box to obtain numbers near the points on the plot that correspond to the alpha-numeric
ordered genes.
Number of Factors Plotted
Specify the number of factors to be plotted.
You can limit the number of plots generated using this parameter.
RECOMMENDED:
Usually, you will only have interest in the first 3 or 4 factors.

Report Options
These options determine the format of the reports.
Precision
Specifies whether unformatted numbers are displayed as single (7-digit) or double (13-digit)
precision numbers.

• Single
Unformatted numbers are displayed with 7-digits. This is the default setting. All reports have
been formatted for single precision.

• Double
Unformatted numbers are displayed with 13-digits. This option is most often used when the
extremely accurate results are needed for further calculation. Double precision numbers will
require more space than allotted, potentially resulting in unaligned output. This option is
provided for those instances when accuracy is more important than format alignment.
250-6 Principal Components Analysis

COMMENTS:
This option does not affect formatted numbers such as probability levels.
This option only influences the format of the numbers as they are output. All calculations are
performed in double precision regardless of selection.
Variable Names
Specify whether to use variable names, variable labels, or both to label output reports.

• Names
Variable Names are the column headings that appear on the database. They may be modified
by selecting the Variable Info tab at the bottom of the spreadsheet or by clicking the right
mouse button while the mouse is pointing to the column heading.

• Labels
This refers to the optional labels that may be specified for each variable. Pressing the
Variable Info tab at the bottom of the spreadsheet window allows you to enter them.
COMMENTS:
1. Most reports are formatted to receive about 12 characters for variable names.
2. Variable Names cannot contain blanks or math symbols (like + - * / . ,), but variable
labels can.
Alpha
The alpha value that is used in the residual reports to test if the observation is an outlier.
RECOMMENDED: .05 or perhaps .10.
RANGE: .001 to .400.
Minimum Loading
Specifies the minimum absolute value that a loading can have and still remain in the Variable List
report. No variables are kept in Variable List Report with |loadings| < this amount.
RECOMMENDED: .4
RANGE: Since the loading is a correlation, its absolute value can range between 0 and 1.

Storage Tab

Storage Variable(s)
Factor Scores Storage Variable(s)
You can automatically store the factor scores for each row into the variables specified here. These
scores are generated for each row of data in which all independent variable values are
nonmissing.
WARNING: Existing data in these variables will be replaced with the new values automatically
when the procedure is run.
Principal Components Analysis 250-7

Scores Plot Tab and Loadings Plot Tab


These sections specify the pair-wise plots of the scores and loadings.

Vertical and Horizontal Axes


These options are used to format the plot axes.
Label
This is the text of the label. The characters {Y} and {X} are replaced by appropriate names. Press
the button on the right of the field to specify the font of the text.
Minimum and Maximum
These options specify the minimum and maximum values to be displayed on the vertical (Y) and
horizontal (X) axis. If left blank, these values are calculated from the data.
Tick Label Settings…
This option specifies the characteristics of the reference numbers.
It displays a window that edits the font size and color of the reference numbers that appear next to
the text along the axis of the plot.
It also allows you to set the number of digits in the reference numbers as well as their
vertical/horizontal orientation.
Note that in some cases, the format specified here is overridden by the variable's format as
specified on the database in the Variable Info Sheet.
Major Ticks - Minor Ticks
These options set the number of major and minor tickmarks displayed on each axis.
Show Grid Lines
These check boxes indicate whether the grid lines should be displayed.

Scores Plot Settings


These options are used to specify the appearance of the histograms.
Style File
Designate a scatter plot style file. This file sets all scatter plot options that are not set directly on
this panel. Unless you choose otherwise, the default style file (Default) is used. These files are
created in the Scatter Plot procedure.
Interior Color
Specify the interior color of the plot.
Background Color
Specify the background color of the plot.
Symbol
Click this box to bring up the symbol specification dialog box. This window will let you set the
symbol type, size, and color.
250-8 Principal Components Analysis

Scores Plot Title


Title
This is the text of the title. The characters {Y} and {X} are replaced by appropriate names. Press
the button on the right of the field to specify the font of the text.

Template Tab
The options on this panel allow various sets of options to be loaded (File menu: Load Template)
or stored (File menu: Save Template). A template file contains all the settings for this procedure.

Specify the Template File Name


File Name
Designate the name of the template file either to be loaded or stored.

Select a Template to Load or Save


Template Files
A list of previously stored template files for this procedure.
Template Id’s
A list of the Template Id’s of the corresponding files. This id value is loaded in the box at the
bottom of the panel.

Example 1 – Principal Components Analysis – Analysis


Steps
This section presents an example of a principal components analysis for 10 genes on 10
individuals. The 10 genes are selected based on two-sample T-Tests for hundreds of genes.

Step 1 – Pre-Processing
The 10 arrays used in the example have already been pre-processed using one of the pre-
processing procedures.

Step 2 – Obtain a List of Genes


A list of the genes to be analyzed is required for running Principal Components Analysis in
GESS. In this example, this list was obtained by running the GESS: T-Test – Two Groups
procedure. A list of significant genes was output to the spreadsheet into the third column.
Principal Components Analysis 250-9

Step 3 – Spreadsheet Setup


To open the PCA1 dataset, use the following steps.

1 Open the PCA1 dataset.


• From the File menu of the NCSS Data window, select Open.
• Select the Data subdirectory of your NCSS directory.
• Open the GESS folder.
• Click on the file PCA1.S0.
• Click Open.

The PCA1 dataset should appear as


PCA1 dataset
Group OutputFile C3
A %p%\DATA\GESS\PCA\Ex1\PCA1_1.ges 100084_at
A %p%\DATA\GESS\PCA\Ex1\PCA1_2.ges 101482_at
A %p%\DATA\GESS\PCA\Ex1\PCA1_3.ges 31962_at
A %p%\DATA\GESS\PCA\Ex1\PCA1_4.ges 37001_at
A %p%\DATA\GESS\PCA\Ex1\PCA1_5.ges 37029_at
B %p%\DATA\GESS\PCA\Ex1\PCA1_6.ges 37725_at
B %p%\DATA\GESS\PCA\Ex1\PCA1_7.ges 38730_at
B %p%\DATA\GESS\PCA\Ex1\PCA1_8.ges 39425_at
B %p%\DATA\GESS\PCA\Ex1\PCA1_9.ges 40515_at
B %p%\DATA\GESS\PCA\Ex1\PCA1_10.ges 94766_at

Step 4 – Run the Analysis


The following steps are taken to run the principal components analysis. You may follow along
here by making the appropriate entries or load the completed template Example 1 from the
Template tab of the GESS Principal Components Analysis window.

2 Open the GESS Principal Components Analysis window.


• On the menus, select GESS, then Multivariate Routines, then Principal Components
Analysis. The Principal Components Analysis procedure will be displayed.
• On the menus, select File, then New Template. This will fill the procedure with the
default template.

3 Specify the variables.


• On the GESS: Principal Components Analysis window, select the Variables tab.
• Set the Input Files Variable to OutputFile.
• Under Genes to be Analyzed, enter var(C3).
• Set Number of Factors Method to Number of Factors.
• Set Number of Factors to 3.
• Set Factor Rotation to None.
• Set Matrix Type to Correlation.
250-10 Principal Components Analysis

4 Specify the reports.


• Select the Reports tab.
• Check all reports and plots. Normally only a few of these reports would be selected, but
all are selected here for documentation.

5 Run the procedure.


• From the Run menu, select Run Procedure. Alternatively, just click the Run button (the
left-most button on the button bar at the top).
• Note: The PCA Machine Zero may need to be changed to 1E-14. This is done be
clicking on Edit, then Options, then the Constants tab.

Descriptive Statistics Section


Descriptive Statistics Section
Standard
Gene Count Mean Deviation Communality
31962_at 10 4.442558 1.437255 0.993389
37001_at 10 4.196865 1.076856 0.963920
37029_at 10 4.268414 1.277641 0.983186
37725_at 10 4.517559 1.499546 0.990291
38730_at 10 4.492593 2.017406 0.994328
39425_at 10 4.377913 1.741861 0.990387
40515_at 10 4.331538 1.685082 0.992904
100084_at 10 4.552278 2.000915 0.985668
101482_at 10 4.410147 1.26448 0.997787
94766_at 10 4.462752 1.511935 0.992223

This report lets us compare the relative sizes of the standard deviations. In this data set, they are
all about the same size, so either the correlation or the covariance matrix would be appropriate.
The correlation matrix was used in this example.
Count, Mean, and Standard Deviation
These are the familiar summary statistics of each gene. They are displayed to assure that the correct
genes have been specified.
Communality
The communality shows how well this gene is predicted by the retained factors (principal
components). It is the R2 that would be obtained if the values for this gene were regressed on the
factors that were kept. In this example, three factors were kept, and the communality is very high
for all genes.

Correlation Section
Correlation Section
Gene
Gene 31962_at 37001_at 37029_at 37725_at 38730_at
31962_at 1.000000 0.884614 -0.884496 -0.951134 -0.951301
37001_at 0.884614 1.000000 -0.963243 -0.941558 -0.966382
37029_at -0.884496 -0.963243 1.000000 0.956572 0.979432
37725_at -0.951134 -0.941558 0.956572 1.000000 0.979053
38730_at -0.951301 -0.966382 0.979432 0.979053 1.000000
39425_at 0.944943 0.951462 -0.962088 -0.977044 -0.989917
40515_at 0.941978 0.954740 -0.969694 -0.991261 -0.984643
100084_at -0.912223 -0.953814 0.977394 0.974738 0.981873
101482_at -0.935294 -0.904211 0.930076 0.926151 0.961415
94766_at 0.915796 0.965694 -0.978651 -0.974056 -0.990241
Phi=0.959061 Log(Det|R|)=0.000000 Bartlett Test=0.00 DF= Prob=0.000000
Principal Components Analysis 250-11

Gene
Gene 39425_at 40515_at 100084_at 101482_at 94766_at
31962_at 0.944943 0.941978 -0.912223 -0.935294 0.915796
37001_at 0.951462 0.954740 -0.953814 -0.904211 0.965694
37029_at -0.962088 -0.969694 0.977394 0.930076 -0.978651
37725_at -0.977044 -0.991261 0.974738 0.926151 -0.974056
38730_at -0.989917 -0.984643 0.981873 0.961415 -0.990241
39425_at 1.000000 0.988392 -0.991480 -0.963947 0.990537
40515_at 0.988392 1.000000 -0.989979 -0.935585 0.983204
100084_at -0.991480 -0.989979 1.000000 0.946251 -0.987417
101482_at -0.963947 -0.935585 0.946251 1.000000 -0.955843
94766_at 0.990537 0.983204 -0.987417 -0.955843 1.000000
Phi=0.959061 Log(Det|R|)=0.000000 Bartlett Test=0.00 DF= Prob=0.000000

Bar Chart of Absolute Correlation Section


Gene
Gene 31962_at 37001_at 37029_at 37725_at 38730_at
31962_at |||||||||||||||||| |||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
37001_at |||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
37029_at |||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
37725_at |||||||||||||||||||| ||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
38730_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
39425_at ||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
40515_at ||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
100084_at ||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
101482_at ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
94766_at ||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
Phi=0.959061 Log(Det|R|)=0.000000 Bartlett Test=0.00 DF= Prob=0.000000

Gene
Gene 39425_at 40515_at 100084_at 101482_at 94766_at
31962_at ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| |||||||||||||||||||
37001_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
37029_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
37725_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
38730_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
39425_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
40515_at |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
100084_at |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
101482_at |||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||||
94766_at |||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||
Phi=0.959061 Log(Det|R|)=0.000000 Bartlett Test=0.00 DF= Prob=0.000000

The report gives the correlations for a test of the overall correlation structure in the data. In this
example, we notice very high correlation values. The Gleason-Staelin redundancy measure, phi,
is 0.959, which is quite large. There is apparently some correlation structure in this data set that
can be modeled. If all the correlations were small, there would be no need for a PCA.
Correlations
The simple correlations between each pair of variables.
Phi
This is the Gleason-Staelin redundancy measure of how interrelated the variables are. A zero
value of ϕ1 means that there is no correlation among the variables (genes), while a value of one
indicates perfect correlation among the variables. This coefficient may have a value less than 0.5
even when there is obvious structure in the data, so care should to be taken when using it. This
statistic is especially useful for comparing two or more sets of data. The formula for computing
ϕ2 is:
250-12 Principal Components Analysis

p p
∑ ∑ r2 ij -p
i =1 j =1
ϕ=
p(p - 1)

Log(Det|R|)
This is the log (base e) of the determinant of the correlation matrix. If the covariance matrix is
used, this is the log (base e) of the determinant of the covariance matrix.
Bartlett Test, DF, Prob
This is Bartlett’s sphericity test (Bartlett, 1950) for testing the null hypothesis that the correlation
matrix is an identity matrix (all correlations are zero). If a probability (Prob) value greater than
0.05 is obtained, PCA may not be useful. The test is valid for large samples (N>150). It uses a
Chi-square distribution with p(p-1)/2 degrees of freedom. This test is only available when the
correlation matrix is chosen. The formula for computing this test is:

χ2 =
(11 + 2p - 6N ) Log R
e
6
Bar Chart of Absolute Correlation Section
This chart graphically displays the absolute values of the correlations. It lets you quickly find
high and low correlations.

Eigenvalues
Eigenvalues
Individual Cumulative
No. Eigenvalue Percent Percent Scree Plot
1 9.630263 96.30 96.30 ||||||||||||||||||||
2 0.165738 1.66 97.96 |
3 0.088083 0.88 98.84 |
4 0.058978 0.59 99.43 |
5 0.029022 0.29 99.72 |
6 0.013452 0.13 99.86 |
7 0.010674 0.11 99.96 |
8 0.003479 0.03 100.00 |
9 0.000312 0.00 100.00 |
10 0.000000 0.00 100.00

Eigenvalue
The eigenvalues are often used to determine how many factors to retain. (In this example, only
the first eigenvalue would be retained.)
When the PCA is run on the correlations, one rule-of-thumb is to retain those factors whose
eigenvalues are greater than one. The sum of the eigenvalues is equal to the number of variables.
Hence, in this example, the first factor retains the information contained in 9.63 of the original
genes.
When the PCA is run on the covariances, the sum of the eigenvalues is equal to the sum of the
variances of the variables.
Individual and Cumulative Percents
The first column gives the percentage of the total variation in the variables accounted for by this
factor. The second column is the cumulative total of the percentage. Some authors suggest that
Principal Components Analysis 250-13

the user pick a cumulative percentage, such as 80% or 90%, and keep enough factors to attain this
percentage.
Scree Plot
This is a rough bar plot of the eigenvalues. It enables one to quickly note the relative size of each
eigenvalue. Many authors recommend it as a method of determining how many factors to retain.
The word scree, first used by Cattell (1966), is usually defined as the rubble at the bottom of a
cliff. When using the scree plot, one must determine which eigenvalues form the “cliff” and
which form the “rubble.” Only the factors that make up the cliff should be kept. Cattell and
Jaspers (1967) suggest keeping those that make up the cliff plus the first factor of the rubble.

Eigenvectors
Eigenvectors
Factors
Gene Factor1 Factor2 Factor3
31962_at -0.306025 -0.713111 -0.286313
37001_at -0.311531 0.410212 -0.125888
37029_at 0.315369 -0.381861 -0.117416
37725_at 0.317630 0.080305 0.447482
38730_at 0.321315 0.009693 -0.024635
39425_at -0.320529 -0.055078 0.073735
40515_at -0.319874 0.030080 -0.289626
100084_at 0.319100 -0.172377 -0.040621
101482_at 0.310574 0.329642 -0.760023
94766_at -0.319949 0.168886 0.137616

Bar Chart of Absolute Eigenvectors


Factors
Gene Factor1 Factor2 Factor3
31962_at ||||||| ||||||||||||||| ||||||
37001_at ||||||| ||||||||| |||
37029_at ||||||| |||||||| |||
37725_at ||||||| || |||||||||
38730_at ||||||| | |
39425_at ||||||| || ||
40515_at ||||||| | ||||||
100084_at ||||||| |||| |
101482_at ||||||| ||||||| ||||||||||||||||
94766_at ||||||| |||| |||

Eigenvector
The eigenvectors are the weights that relate the scaled original variables (genes), xi = (Xi-
Meani)/Sigmai, to the factors. For example, the first factor, Factor1, is a contrast of the weighted
average of Group A versus the weighted average of Group B, the weight of each variable given
by the corresponding element of the first eigenvector. Mathematically, the relationship is given
by:
Factor1 = v11 x11 + v12 x12 +...+ v1 p x1 p
These coefficients may be used to determine the relative importance of each variable in forming
the factor. Often, the eigenvectors are scaled so that the variances of the factor scores are equal to
one. These scaled eigenvectors are given in the Score Coefficients section described later.
250-14 Principal Components Analysis

Bar Chart of Absolute Eigenvectors


This chart graphically displays the absolute values of the eigenvectors. It allows the viewer to
quickly interpret the eigenvector structure. By looking at which variables correlate highly with a
factor, you can determine what underlying structure it might represent.

Factor Loadings
Factor Loadings
Factors
Gene Factor1 Factor2 Factor3
31962_at -0.949677 -0.290314 -0.084974
37001_at -0.966765 0.167001 -0.037362
37029_at 0.978675 -0.155459 -0.034848
37725_at 0.985690 0.032693 0.132807
38730_at 0.997126 0.003946 -0.007311
39425_at -0.994688 -0.022423 0.021884
40515_at -0.992656 0.012246 -0.085957
100084_at 0.990251 -0.070176 -0.012056
101482_at 0.963793 0.134200 -0.225565
94766_at -0.992888 0.068755 0.040843

Bar Chart of Absolute Factor Loadings


Factors
Gene Factor1 Factor2 Factor3
31962_at ||||||||||||||||||| |||||| ||
37001_at |||||||||||||||||||| |||| |
37029_at |||||||||||||||||||| |||| |
37725_at |||||||||||||||||||| | |||
38730_at |||||||||||||||||||| | |
39425_at |||||||||||||||||||| | |
40515_at |||||||||||||||||||| | ||
100084_at |||||||||||||||||||| || |
101482_at |||||||||||||||||||| ||| |||||
94766_at |||||||||||||||||||| || |

Factor Loadings
These are the correlations between the variables (genes) and factors.
Bar Chart of Absolute Factor Loadings
This chart graphically displays the absolute values of the factor loadings. It allows the viewer to
quickly interpret the correlation structure. By looking at which variables (genes) correlate highly
with a factor, underlying structure it might represent can be determined.
Interpretation of the Example
We now go through the interpretation of each factor. Factor one appears to be the contrast of the
genes of the two groups (as expected). Although Factor2 is probably not very useful, it may be
interpreted as a contrast of 31962_at + 37029_at versus 37001_at + 101482_at. Factor3 has heavy
weight with 101482_at. If the functions of the genes were known, the viewer could try to attach
meaning to these patterns.
Principal Components Analysis 250-15

Communalities
Communalities
Factors
Gene Factor1 Factor2 Factor3 Communality
31962_at 0.901887 0.084282 0.007221 0.993389
37001_at 0.934635 0.027889 0.001396 0.963920
37029_at 0.957804 0.024168 0.001214 0.983186
37725_at 0.971585 0.001069 0.017638 0.990291
38730_at 0.994259 0.000016 0.000053 0.994328
39425_at 0.989405 0.000503 0.000479 0.990387
40515_at 0.985365 0.000150 0.007389 0.992904
100084_at 0.980598 0.004925 0.000145 0.985668
101482_at 0.928897 0.018010 0.050880 0.997787
94766_at 0.985827 0.004727 0.001668 0.992223

Bar Chart of Communalities


Factors
Gene Factor1 Factor2 Factor3 Communality
31962_at ||||||||||||||||||| || | ||||||||||||||||||||
37001_at ||||||||||||||||||| | | ||||||||||||||||||||
37029_at |||||||||||||||||||| | | ||||||||||||||||||||
37725_at |||||||||||||||||||| | | ||||||||||||||||||||
38730_at |||||||||||||||||||| | | ||||||||||||||||||||
39425_at |||||||||||||||||||| | | ||||||||||||||||||||
40515_at |||||||||||||||||||| | | ||||||||||||||||||||
100084_at |||||||||||||||||||| | | ||||||||||||||||||||
101482_at ||||||||||||||||||| | || ||||||||||||||||||||
94766_at |||||||||||||||||||| | | ||||||||||||||||||||

Communality
The communality is the proportion of the variation of a variable that is accounted for by the
factors that are retained. It is the R² value that would be achieved if this variable were regressed
on the retained factors. This table value gives the amount added to the communality by each
factor.
Bar Chart of Communalities
This chart graphically displays the values of the communalities.

Factor Structure Summary


Factor Structure Summary
Factors
Factor1 Factor2 Factor3
38730_at
39425_at
94766_at
40515_at
100084_at
37725_at
37029_at
37001_at
101482_at
31962_at

Interpretation
This report is provided to summarize the factor structure. Variables with an absolute loading
greater than the amount set in the Minimum Loading option are listed under each factor. Using
this report, you can quickly see which variables are related to each factor. Notice that it is
250-16 Principal Components Analysis

possible for a variable to have high loadings on several factors. In this example, all the important
loadings are associated with Factor1.

Score Coefficients
Score Coefficients
Factors
Gene Factor1 Factor2 Factor3
31962_at -9.861384E-02 -1.751647 -0.9647075
37001_at -0.1003882 1.007622 -0.4241694
37029_at 0.1016249 -0.9379836 -0.3956228
37725_at 0.1023534 0.1972568 1.507755
38730_at 0.1035408 2.380825E-02 -8.300588E-02
39425_at -0.1032878 -0.1352902 0.2484457
40515_at -0.1030767 7.388766E-02 -0.9758725
100084_at 0.102827 -0.4234185 -0.1368682
101482_at 0.1000796 0.8097138 -2.560834
94766_at -0.1031009 0.4148414 0.4636864

Score Coefficients
These are the coefficients that are used to form the factor scores. The factor scores are the values
of the factors for a particular row of data. These score coefficients are similar to the eigenvectors.
They have been scaled so that the scores produced have a variance of one rather than a variance
equal to the eigenvalue. This causes each of the factors to have the same variance.
These scores would be used to calculate the factor scores for new rows not included in the
original analysis.

Residual Section
Residual Section
Row T2 T2 Prob Q0 Q1 Q2 Q3
1 8.10 1.0000 10.26 0.39 0.20 0.19
2 8.10 1.0000 8.83 0.12 0.10 0.08
3 8.10 1.0000 8.00 0.30 0.30 0.14
4 8.10 1.0000 9.19 0.33 0.27 0.06
5 8.10 1.0000 8.53 0.40 0.14 0.14
6 8.10 1.0000 8.17 0.88 0.12 0.12
7 8.10 1.0000 7.62 0.24 0.18 0.05
8 8.10 1.0000 10.16 0.11 0.09 0.09
9 8.10 1.0000 10.60 0.34 0.33 0.11
10 8.10 1.0000 8.64 0.21 0.11 0.07

This report is useful for detecting outliers – observations that are very different from the bulk of
the data. To do this, two quantities are displayed: T² and Qk. We will now define these two
quantities.
T² measures the combined variability of all the variables in a single observation. Mathematically,
T² is defined as:
T 2 = [ x - x ] ′ S -1 [ x - x ]
where x3 represents a p-variable observation, x 4 represents the p-variable mean vector, and S-15
represents the inverse of the covariance matrix.
T is not affected by a change in scale. It is the same whether the analysis is performed on the
covariance or the correlation matrix. T² gives a scaled distance measure of an individual
Principal Components Analysis 250-17

observation from the overall mean. The closer an observation is to its mean, the smaller will be
the value of T².
If the variables follow a multivariate normal distribution, then the probability distribution of T²
may be related to the common F distribution using the formula:
p(n - 1)
T 2p,n,α = F p,n- p,α
n- p
Using this relationship, we can perform a statistical test at a given level of significance to
determine if the observation is significantly different from the vector of means. You set the α
value using the Alpha option. Since this test is being performed N times, you would anticipate
about N(1-α) observations to be significant by chance variation. In our current example, rows two
and three are starred (which means they were significant at the .05 significance level). You would
probably want to check for data entry or transcription errors. (Of course, in this data set, these
rows were made to be outliers.)
T² is really not part of a normal PCA since it may be calculated independently. It is presented to
help detect observations that may have an undue influence on the analysis. You can read more
about its use and interpretation in Jackson (1991).
The other quantity shown on this report is Qk. Qk represents the sum of squared residuals when an
observation is predicted using the first k factors. Mathematically, the formula for Qk is:
Q k = ( x - x$ ) ′( x - x$ )
p

∑ (x - k x$ i )
2
= i
i =1

( )
p 2
= ∑ λ i pc i
i = k +1

Here kxi6 refers to the value of variable i predicted from the first k factors, λi7 refers to the ith
eigenvalue, and pci8 is the score of the ith factor for this particular observation. Further details are
given in Jackson (1991) on pages 36 and 37.
An upper limit for Qk is given by the formula:
1/ h
⎡ z 2b h 2 bh(h - 1) ⎤
Qα = a⎢ α + +1 ⎥
⎢⎣ a a2 ⎥⎦
where
p
a= ∑λ
i =k +1
i

p
b= ∑λ
i =k +1
2
i

p
c= ∑λ
i =k +1
3
i

2ac
h =1-
3 b2
and
250-18 Principal Components Analysis

zα is the upper normal deviate of area α if h is positive or the lower normal deviate of area α if h
is negative.
This limit is valid for any value of k, whether too many or too few factors are kept. Note that
these formulas are for the case when the correlation matrix is being used. When the analysis is
being run on the covariance matrix, the pci’s must be adjusted. Further details are given in
Jackson (1991).
Notice that significant (starred) values of Qk indicate observations that are not duplicated well by
the first k factors. These should be checked to see if they are valid. Qk and T² provide an initial
data screening tool.
Interpretation of the Example
None of the rows shows any sign of being an outlier. The T2 Probs are all 1.

Factor Score
Factor Score
Factors
Row Factor1 Factor2 Factor3
1 -1.0120 1.0718 0.3154
2 -0.9514 -0.2871 -0.4970
3 -0.8938 -0.1370 1.3564
4 -0.9587 0.6131 -1.5454
5 -0.9190 -1.2364 0.2941
6 0.8705 -2.1380 -0.1182
7 0.8756 0.6267 1.2020
8 1.0215 0.3782 -0.0955
9 1.0323 0.3053 -1.5639
10 0.9351 0.8035 0.6521

This report presents the individual factor scores scaled so each column has a mean of zero and a
standard deviation of one. These are the values that are plotted in the plots that follow. There is
one row of score values for each observation and one column for each factor that was kept.

Factor Score Plots

Factor Scores Factor Scores


1.50 1.50

98 9 8
6 7 10 6
10 7
0.75 0.75
Score1

Score1

0.00 0.00

-0.75 -0.75
5 2 3 4 1 4 2 5
1
3

-1.50 -1.50
-2.50 -1.50 -0.50 0.50 1.50 -2.00 -1.13 -0.25 0.63 1.50
Score2 Score3
Principal Components Analysis 250-19

Factor Scores
1.50

1
10
4 7
0.50
9 8

3
Score2

2
-0.50

5
-1.50

-2.50
-2.00 -1.13 -0.25 0.63 1.50
Score3

This set of plots shows each factor plotted against every other factor. The first k factors (where k
is the number of large eigenvalues) usually show the major structure that will be found in the
data. The rest of the factors show outliers and linear dependencies. The numbers correspond to
the row ordered individuals.

Factor Loading Plots

Factor Loadings Factor Loadings


1.00 3 8 5 4 9 1.00 9 3 85 4

0.50 0.50
Loading1

Loading1

0.00 0.00

-0.50 -0.50

1 2 1 2
-1.00 6 7 10 -1.00 7 6 10
-0.30 -0.18 -0.05 0.08 0.20 -0.25 -0.15 -0.05 0.05 0.15
Loading2 Loading3

Factor Loadings
0.20
2
9

0.08 10
4
7
Loading2

5
6
-0.05
8

3
-0.18

-0.30 1

-0.25 -0.15 -0.05 0.05 0.15


Loading3

Discussion of Factor Loading Plots


This set of plots shows each of the factor loading columns plotted against each other. The data
points represent genes. The numbers correspond to the alpha-numeric ordered genes (the order is
250-20 Principal Components Analysis

shown in the Score Coefficients section). The plot allows you to find genes that are highly
correlated with both factors. It is anticipated that this will aid in the interpretation of the factors.
In this example, genes 3 (37029_at), 4 (37725_at), 5 (38730_at), 8 (100084_at), and 9
(101482_at) have means that are large in Group A and small in Group B. Genes 1 (31962_at), 2
(37001_at), 6 (39425_at), 7 (40515_at), and 10 (94766_at) have means that are large in Group B
and small in Group A.

S-ar putea să vă placă și