Sunteți pe pagina 1din 1

Principal Component Analysis

 A method of extracting important variables (in form of components) from a large set of variables available in
a data set.
 Bring out strong patterns from large and complex datasets
 Always performed on a symmetric correlation or covariance matrix. This means the matrix should be
numeric and have standardized data

Steps:

1. Data pre-processing (excluding features based on business acumen, missing value/garbage value treatment
etc.)
2. If any categorical variables are present, convert them to numerical features using one-hot encoding
(Refer: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/ )
3. Divide the data into Train-Test datasets (80-20)
4. Scaling and centring of Training data – Data Normalization based on mean and standard deviation (basically
calculate Z-scores)

Why is Normalization required?

 The principal components are supplied with normalized version of original predictors. This is
because, the original predictors may have different scales. For example: Imagine a data set with
variables’ measuring units as gallons, kilometres, light years etc. It is definite that the scale of
variances in these variables will be large.
 Performing PCA on un-normalized variables will lead to insanely large loadings for variables with
high variance. In turn, this will lead to dependence of a principal component on the variable with
high variance. This is undesirable.
5. Principal Components:
a. Principal Components (PCs) are drawn such that the distance between the perpendicular projections
of the original datapoints on the PC is maximum and the perpendicular distance of a projection from
its original datapoint is minimum
b. In simple words, the proxy of any datapoint on the PC is as close to the original datapoint as possible
while all the proxies on the PC are as distant from each other as possible
c. Once we draw one PC which explains the maximum variance, the second PC is drawn perpendicular
to it such that it explains the next best amount of variance
E.g., if PC1 explains 85% of the variance, then PC2 would be able to explain 10% of the remaining
variance unexplained by the PC1 and that explained by PC3 could be 2% and so on.
d. If majority of variance (usually >95%) is explained by 2 or 3 PCs then it makes sense to do PCA; using
the scree plot (a.k.a. “Elbow curve”) you can decide the number of PCs to consider for your model
e. Post drawing the PCs, we can rotate them to look like usual X-Y axes for simplicity
f. Transformation matrix does all that for you in a single line of command
6. Once the PCA is done, you can see inherent clusters in the data in the PCA score plot
7. This transformed data can now be used to create a model based on the 2/3 PCs as the new “features.”
8. The test data being used for validation must also be normalized using the same mean and standard
deviation as used in step 1 and then the transformation will take place.
9. Post normalization & transformation of Test data, it can be used to evaluate the model.

Reference: https://blog.bioturing.com/2018/06/18/how-to-read-pca-biplots-and-scree-plots/

Find the code here: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-


python/

Homework:

Cereals dataset >> exclude ‘shelf’ and ‘cereal name’ >> Rating is ‘y’ >> Do train-test >> Using train data create PCs>>
Link it to the outcome ‘y’ >> Use this model to predict values for the test dataset

S-ar putea să vă placă și