Sunteți pe pagina 1din 13

Principal Component Analysis

Background

Section 10.2 of Math &


Machine
Introduction to Statistical
Learning Learning
Statistics
By Gareth James,DS et al.
Software Research

Domain
Knowledge
Background

● Let’s discuss the basic idea behind principal component


analysis. Machine Math &
Learning
● It is an unsupervised statistical technique Statistics
used to examine
the interrelations among a set ofDSvariables in order to
identify the underlying structure
Software
of those
Researchvariables.
● It is also known sometimes as a general factor analysis.

Domain
Knowledge
Background

● Where regression determines a line of best fit to a data set,


factor analysis determines several Math &
Machineorthogonal lines of best
Learning
fit to the data set. Statistics
● Orthogonal means “at right angles”. DS
○ Actually the lines are Software
perpendicular to each other in
Research
n-dimensional space.
● n-Dimensional Space is the variable sample space.
○ There are as many dimensions Domain as there are variables, so
in a data set with 4 variables the sample space is
Knowledge
4-dimensional.
Background
● Here we have some data
plotted along two Machine Math &
features, x and y. Learning
Statistics
DS

Software Research

Domain
Knowledge
Background
● We can add an
orthogonal line. Machine Math &
● Now we can begin to Learning
Statistics
understand the
DS
components!
Software Research

Domain
Knowledge
Background
● Components are a linear
transformation that Machine Math &
chooses a variable Learning
Statistics
system for the data set
DS
such that the greatest
Research
variance of the data set Software
comes to lie on the first
axis
Domain
Knowledge
Background
● The second greatest
variance on the second Machine Math &
axis, and so on … Learning
Statistics
● This process allows us to
DS
reduce the number of
Research
variables used in an Software

analysis.
Domain
Knowledge
Background
● Note that components
are uncorrelated, since in Machine Math &
the sample space they Learning
Statistics
are orthogonal to each
DS
other.
Software Research

Domain
Knowledge
Background
● We can continue this
analysis into higher Machine Math &
dimensions Learning
Statistics
DS

Software Research

Domain
Knowledge
Background

● If we use this technique on a data set with a large number


of variables, we can compressMachine
the amount Math &
of explained
Learning
variation to just a few components. Statistics
● The most challenging part of PCA DS is interpreting the
components. Software Research

Domain
Knowledge
Background

● For our work with Python, we’ll walk through an example of


learn. Math &
how to perform PCA with scikitMachine
Learning
● We usually want to standardize our data Statistics
by some scale for
PCA, so we’ll cover how to do this DS as well.
● Since this algorithm is used usually Research
Software
for analysis of data and
not a fully deployable model, there won’t be a portfolio
project for this topic.
Domain
Knowledge
Example with Python

S-ar putea să vă placă și