Sunteți pe pagina 1din 20

Principal

Component
Analysis
An Introduction to dimensionality reduction.

-Sahil Imani
Some prerequisites before
getting into PCA.
 Origins of PCA
 Importance of Variance in data and information Entropy.
 What do we mean by dimensions?
 Why do we need it to reduce dimensions?
 The logic Behind and A visual explaination
PCA: Origins
 Comes from Statistics, a part of factor analysis and
Dimensionality reduction (Feature Extraction).
 Is NOT a Machine Learning technique by itself.
 Goal Of data analysis is to generally make “sense” of
the data.
 Is done in 3 iteratively steps (Clean, Reduce, Transform)
until we get to an acceptable level.
Video Taken from Computerphile Youtube Channel
Importance of Variance in data
and information Entropy.

 Information Entropy Basically tells the rate of


information generation from a stochastic process.
 Basically, it gives us a relation between
 The Information Gain vs The Uncertainty.
 The more the uncertainty the more the information
Transfer/gained.
Dimensionality
 In terms of data analysis, the number of attributes or
features that determines the final output of a data
driven decision is known as its dimensionality.
 The more attributes we use to better define something
the more “dimensions” it has.
 For dimensions greater than 4 however it becomes
impossible to visualize on a 2d plane. Which is why we
need to reduce/project the data to a lower dimension
while at the same time trying to retain most of the
information.
Need to reduce
Dimensionality
 Helps in data Visualization
 Makes Calculations faster and the upcoming machine
learning stage needs less data to work with for the same
amount of information
 Reduces the data set so we can start drawing
Conclusions
 Optimizes the data for use in Actual machine learning or
statistical modelling.
PCA: The logic behind it and
visual explanation
 A common Example of Dimensionality reduction in
everyday life.
 Some Dimensions/factors contains much more
information than others.
 If we can find the principle or the “important”
dimensions we can discard the ones that doesn’t
contribute much or some highly correlated dimensions.
 This is the logical basis for PCA.
 Visually, we can see it (for 2 dimensions) as trying to fit
in a line along the direction of maximum Variance.
 It will be a linear combination of both the dimensions
A 2D visualization of a data
set having two attributes.
The Math Powering it.
Programming Implementation

 The Basic Flow is:


 To find the Eigen Values and Eigen Vectors Of the
Covariance matrix of the attributes.
 Sort the Eigen Vectors according to the eigen Values(from
max to Min).
 Discard as many Principal components as long as we are
within the Amount of information we need.
 Reproject The data using the reduced Dimension.
Thank You

S-ar putea să vă placă și