Practical Guide To Principal Component Analysis (PCA) in R & Python

23/03/2016
Practical Guide to Principal Component Analysis (PCA) in R & Python
(https://www.facebook.com/AnalyticsVidhya)
(https://twitter.com/analyticsvidhya)
(https://plus.google.com/+Analyticsvidhya/posts)
(https://www.linkedin.com/groups/Analytics-Vidhya-Learn-everything-about-5057165)
(http://www.analyticsvidhya.com)
(http://datahack.analyticsvidhya.com/contest/practice-problem-time-series)
Home (http://www.analyticsvidhya.com/) Python (http://www.analyticsvidhya.com/blog/category/python-2/) Pr...
Practical Guide to Principal Component Analysis

(PCA) in R & Python
PYTHON (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/PYTHON-2/)
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/R/)
harer.php?u=http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-
20Principal%20Component%20Analysis%20(PCA)%20in%20R%20&%20Python)
(https://twitter.com/home?
rincipal%20Component%20Analysis%20(PCA)%20in%20R%20&%20Python+http://www.analyticsvidhya.com/blog/2016/03/practical-
hon/)
(https://plus.google.com/share?url=http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-
com/pin/create/button/?url=http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-
dhya.com/wp-
iption=Practical%20Guide%20to%20Principal%20Component%20Analysis%20(PCA)%20in%20R%20&%20Python)
(http://admissions.bridgesom.com/pba-new/?
utm_source=AV&utm_medium=Banner&utm_campaign=AVBanner)
Introduction
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
1/33
23/03/2016
Introduction
Too much of anything is good for nothing!
What happens when a data set has too many variables ? Here are few possible situations
which you might come across:
1. You find thatmost of thevariables are correlated.
2. You lose patience and decide to run a model on whole data. This returnspoor accuracy
andyou feel terrible.
3. You become indecisive about what to do
4. You start thinking of some strategic method to find few important variables
Trust me, dealing with such situations isnt as difficult as it sounds. Statistical techniques
such asfactor analysis, principal component analysis help to overcome such difficulties.
In this post, Ive explained the concept of principal component analysis in detail. Ive kept
the explanation to be simple and informative. For practical understanding, Ive also
demonstrated using this technique in R with interpretations.
Note: Understanding this concept requires prior knowledge of statistics.
What is Principal Component Analysis ?

2/33
23/03/2016
What is Principal Component Analysis ?

In simple words, principal component analysis is a method ofextracting important variables
from a large set of variables available in a data set. It extracts low dimensional set of
features from a high dimensional data set with a motive tocapture as much information as
possible. With fewer variables, visualization also becomes much more meaningful. PCA is
more useful when dealing with 3 or higher dimensional data.
It is always performed on a symmetric correlation or covariance matrix. This means the
matrix should be numeric and have standardized data.
Lets understand it using an example:
Lets say we have a data set of dimension 300 (n) 50 (p). n represents the number of
observations and p represents number of predictors. Since we have a large p = 50, therecan
be p(p1)/2 scatter plots i.e more than 1000 plots possible to analyze the variable
relationship.Wouldnt is be a tedious job to perform exploratory analysis on this data ?
In this case, it would be a lucid approach to select a subset of p (p << 50) predictor which
captures as much information. Followed by plotting the observation in the resultant low
dimensional space.
The image below shows the transformation of a high dimensional data (3 dimension) to low
dimensional data (2 dimension) using PCA. Not to forget, each resultant dimension is a linear
combination of p features
3/33
23/03/2016
Source: nlpca (http://www.nlpca.org/pca_principal_component_analysis.html)
What are principal components ?

A principal component is a normalized linear combination of theoriginal predictors in a data
set. In image above, PC1 and PC2 are the principal components. Lets say we have a set of
predictors as X,X...,Xp
The principal component can be writtenas:
Z=X+X+X+....+p Xp
where,
Z isfirst principal component
p is the loading vector comprising of loadings (
,..) of first principal component.

The loadings are constrained to a sum of square equals to 1. This is because large
magnitude of loadings may lead to large variance. It also defines the direction of the
principal component (Z) along which data varies the most. It results in a line in p
dimensional space which is closest to the n observations. Closeness is measured using
average squared euclidean distance.
X..Xpare normalized predictors. Normalized predictors havemean equals to zero and
4/33
23/03/2016
standard deviation equals to one.
Therefore,
First principal component is a linear combination of original predictor variables which
captures the maximum variance in the data set. It determines the direction of highest
variability in the data. Larger the variability captured in first component, larger the
information captured by component.No other component can have variability higher than
first principal component.
The first principal component results in a line which is closest to the data i.e. it minimizes the
sum of squared distance between a data point and the line.
Similarly, we can compute the second principal component also.
Second principal component ( Z) is also a linear combination of original predictors which

captures the remaining variance in the data set and is uncorrelated with Z.In other words,
the correlation between first and second component should iszero. It can be represented
as:
Z=X+X+X+....+p2Xp
If the two components are uncorrelated, their directions should be orthogonal (image
below). This image is based on a simulated data with 2 predictors. Notice the direction of the
components, as expected they are orthogonal. This suggests the correlation b/w these
components in zero.
5/33
23/03/2016
All succeeding principal component follows a similar concept i.e. they capture the
remaining variation without being correlated with the previous component. In general, for
n pdimensional data, min(n-1, p) principal component can be constructed.

The directions of these components are identified in an unsupervised way i.e. the response
variable(Y) is not used to determine the component direction. Therefore, it is an
unsupervised approach.
Note: Partial least square (PLS) is a supervised alternative to PCA. PLS assigns higher weight
to variables which are strongly related to response variable to determine principal
components.
Why is normalization of variables necessary ?

The principal components are supplied with normalized version of original predictors. This is
because, the original predictors may have different scales. For example: Imagine a data set
with variables measuring units as gallons, kilometers, light years etc. It is definite that the
scale of variances in these variables will be large.
6/33
23/03/2016
Performing PCA on un-normalized variables will lead to insanely large loadings for variables
with high variance. In turn, this will lead to dependence of a principal component on the
variable with high variance. This is undesirable.
As shown in image below, PCA was run on a data set twice (with unscaled and scaled
predictors). This data set has ~40 variables. You can see, first principal component is
dominated by a variable Item_MRP. And, second principal component is dominated by a
variable Item_Weight. This domination prevails due to high value of variance associated
with a variable. When the variables are scaled, we get a much better representation of
variables in 2D space.
Implement PCA in R & Python (with interpretation)

How many principal components to choose ? I could dive deep in theory, but it would be
better to answer these question practically.
For this demonstration, Ill be using the data set from Big Mart Prediction Challenge
(http://datahack.analyticsvidhya.com/contest/practice-problem-bigmart-sales-prediction).
7/33
23/03/2016
Remember, PCA can be applied only on numerical data. Therefore, if the data has
categorical variables they must be converted to numerical. Also, make sure you have done
the basic data cleaning prior to implementing this technique. Lets quickly finish with initial
data loading and cleaning steps:
#directorypath
>path<".../Data/Big_Mart_Sales"
#setworkingdirectory
>setwd(path)
#loadtrainandtestfile
>train<read.csv("train_Big.csv")
>test<read.csv("test_Big.csv")
#addacolumn
>test$Item_Outlet_Sales<1
#combinethedataset
>combi<rbind(train,test)
#imputemissingvalueswithmedian
>combi$Item_Weight[is.na(combi$Item_Weight)]<median(combi$Item_Weight,na.rm=TR
#impute0withmedian
>combi$Item_Visibility<ifelse(combi$Item_Visibility==0,median(combi$Item_Visib
#findmodeandimpute
>table(combi$Outlet_Size,combi$Outlet_Type)
>levels(combi$Outlet_Size)[1]<"Other"
Till here, weve imputed missing values. Now we are left with removing the dependent
(response) variable and other identifier variables( if any). As we said above, we are practicing
an unsupervised learning technique, hence response variable must be removed.
8/33
23/03/2016
#removethedependentandidentifiervariables
>my_data<subset(combi,select=c(Item_Outlet_Sales,Item_Identifier,
Lets check the available variables ( a.k.a predictors) in the data set.
#checkavailablevariables
>colnames(my_data)
Since PCA works on numeric variables, lets see if we have any variable other than numeric.
#checkvariableclass
>str(my_data)
'data.frame':14204obs.of9variables:
$Item_Weight:num9.35.9217.519.28.93...
$Item_Fat_Content:Factorw/5levels"LF","lowfat",..:3535355355...
$Item_Visibility:num0.0160.01930.01680.0540.054...
$Item_Type:Factorw/16levels"BakingGoods",..:515117101141466...
$Item_MRP:num249.848.3141.6182.153.9...
$Outlet_Establishment_Year:int1999200919991998198720091987198520022007..
$Outlet_Size:Factorw/4levels"Other","High",..:3331232311...
$Outlet_Location_Type:Factorw/3levels"Tier1","Tier2",..:1313333322
$Outlet_Type:Factorw/4levels"GroceryStore",..:2321232422...
Sadly, 6 out of 9 variables are categorical in nature. We have some additional work to do
now. Well convert these categorical variables into numeric using one hot encoding.
#loadlibrary
>library(dummies)
#createadummydataframe
>new_my_data<dummy.data.frame(my_data,names=c("Item_Fat_Content","Item_Type",
"Outlet_Establishment_Year","Outlet_Size",
"Outlet_Location_Type","Outlet_Type"))
To check, if we now have a data set of integer values, simple write:
9/33
23/03/2016
#checkthedataset
>str(new_my_data)
And, we now have all the numerical values. We can now go ahead with PCA.
The base R function prcomp() is used to performPCA. By default, it centers the variable to
have mean equals to zero. With parameter scale.=T, we normalize the variables to have
standard deviation equals to 1.
#principalcomponentanalysis
>prin_comp<prcomp(new_my_data,scale.=T)
>names(prin_comp)
[1]"sdev""rotation""center""scale""x"
The prcomp() function results in 5 useful measures:

1. center and scale refers to respective mean and standard deviation of the variables that
are used for normalization prior to implementing PCA
#outputsthemeanofvariables
prin_comp$center
#outputsthestandarddeviationofvariables
prin_comp$scale
2. The rotation measure provides the principal component loading. Each column of rotation
matrix contains the principal component loading vector. This is the most important measure
we should be interested in.
>prin_comp$rotation
This returns 44 principal components loadings. Is that correct ? Absolutely. In a data set, the
maximum number of principal component loadings is a minimum of (n-1, p). Lets look at
first 4 principal components and first 5 rows.
10/33
23/03/2016
>prin_comp$rotation[1:5,1:4]
PC1PC2PC3PC4
Item_Weight0.00544292250.0012856660.0112461940.011887106
Item_Fat_ContentLF0.00219833140.0037685570.0097900940.016789483
Item_Fat_Contentlowfat0.00190427100.0018669050.0030664150.018396143
Item_Fat_ContentLowFat0.00279364670.0022343280.0283098110.056822747
Item_Fat_Contentreg0.00029363190.0011209310.0090332540.001026615
3. In order to compute the principal component score vector, we dont need to multiply the
loading with data. Rather, the matrix x has the principal component score vectors in a
14204 44 dimension.
(http://discuss.analyticsvidhya.com)
> dim(prin_comp$x)
[1] 14204 44
Lets plot the resultant principal components.
>biplot(prin_comp,scale=0)
11/33
23/03/2016
The parameter scale=0 ensures that arrows are scaled to represent the loadings. To
make inference from image above, focus on the extreme ends (top, bottom, left, right) of
this graph.
We
infer
than
first
principal
component
corresponds
to
measure
of
Outlet_TypeSupermarket, Outlet_Establishment_Year 2007. Similarly, it can be said that the

second
component
corresponds
to
measure
of
Outlet_Location_TypeTier1,
12/33
23/03/2016
Outlet_Sizeother. For exact measure of a variable in a component, you should look at

rotation matrix(above) again.
4. The prcomp() function also provides the facility to compute standard deviation of each
principal component. sdev refers to the standard deviation of principal components.
#computestandarddeviationofeachprincipalcomponent
>std_dev<prin_comp$sdev
#computevariance
>pr_var<std_dev^2
#checkvarianceoffirst10components
>pr_var[1:10]
[1]4.5636153.2177022.7447262.5410912.1981522.0153201.9320761.256831
[9]1.2037911.168101
We aim to find the components which explain the maximum variance. This is because, we
want to retain as much information as possible using these components. So, higher is the
explained variance, higher will be the information contained in those components.
To compute the proportion of variance explained by each component, we simply divide the
variance by sum of total variance. This results in:
#proportionofvarianceexplained
>prop_varex<pr_var/sum(pr_var)
>prop_varex[1:20]
[1]0.103718530.073129580.062380140.057752070.049958000.04580274
[7]0.043910810.028564330.027358880.026547740.025598760.02556797
[13]0.025495160.025088310.024939320.024909380.024683130.02446016
[19]0.023903670.02371118
This shows that first principal component explains 10.3% variance. Second component
explains 7.3% variance. Third component explains 6.2% variance and so on. So, how do we
decide how many components should we select for modeling stage ?
13/33
23/03/2016
The answer to this question is provided by a scree plot. A scree plot is used to access
components or factors which explains the most of variability in the data. It represents values
in descending order.
#screeplot
>plot(prop_varex,xlab="PrincipalComponent",
ylab="ProportionofVarianceExplained",
type="b")
The plot above shows that ~ 30 components explains around 98.4% variance in the data set.
In order words, using PCA we have reduced 44 predictors to 30 without compromising on
explained variance. This is the power of PCA> Lets do a confirmation check, by plotting a
cumulative variance plot. This will give us a clear picture of number of components.
#cumulativescreeplot
>plot(cumsum(prop_varex),xlab="PrincipalComponent",
ylab="CumulativeProportionofVarianceExplained",
type="b")
14/33
23/03/2016
This plot shows that 30 components results in variance close to ~ 98%. Therefore, in this
case, well select number of components as 30 [PC1 to PC30] and proceed to the modeling
stage. This completes the steps to implement PCA in R. For modeling, well use these 30
components as predictor variables and follow the normal procedures.
For Python Users: To implement PCA in python, simply import PCA from sklearn library. The
interpretation remains same as explained for R users above. Ofcourse, the result is some as
derived after using R. The data set used for Python is a cleaned version where missing
values have been imputed, and categorical variables are converted into numeric.
importnumpyasnp
fromsklearn.decompositionimportPCA
importpandasaspd
importmatplotlib.pyplotasplt
fromsklearn.preprocessingimportscale
%matplotlibinline
15/33
23/03/2016
#Loaddataset
data=pd.read_csv('Big_Mart_PCA.csv')
#convertittonumpyarrays
X=data.values
#Scalingthevalues
X=scale(X)
pca=PCA(n_components=44)
pca.fit(X)
#TheamountofvariancethateachPCexplains
var=pca.explained_variance_ratio_
#CumulativeVarianceexplains
var1=np.cumsum(np.round(pca.explained_variance_ratio_,decimals=4)*100)
printvar1
[10.3717.6823.9229.734.739.2843.6746.5349.27
51.9254.4857.0459.5962.164.5967.0869.5572.
74.3976.7679.181.4483.7786.0688.3390.5992.7
94.7696.7898.44100.01100.01100.01100.01100.01100.01
100.01100.01100.01100.01100.01100.01100.01100.01]
plt.plot(var1)
16/33
23/03/2016
#LookingataboveplotI'mtaking30variables
pca=PCA(n_components=30)
pca.fit(X)
X1=pca.fit_transform(X)
printX1
For more information on PCA in python, visit scikit learn documentation (http://scikitlearn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
Points to Remember
1. PCA is used to extract important features from a data set.
2. These features are low dimensional in nature.
3. These features a.k.a components are a resultant of normalized linear combination of original
predictor variables.
4. These components aim to capture as much information as possible with high explained
variance.
5. The first component has the highest variance followed by second, third and so on.
6. The components must be uncorrelated (remember orthogonal direction ? ). See above.
7. Normalizing data becomesextremely important when the predictors are measured in
different units.
8. PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions,
it becomes increasingly difficult to make interpretations from the resultant cloud of data.
9. PCA is applied on a data set with numeric variables.
17/33
23/03/2016
10. PCA is a tool which helps to produce better visualizations of high dimensional data.
End Notes
This brings me to the end of this tutorial. Without delving deep into mathematics, Ive tried
to make you familiar with most important concepts required to use this technique. Its
simple but needs special attention while deciding the number of components. Practically,
we should strive to retain only first few k components
The idea behind pca is to construct some principal components( Z << Xp ) which
satisfactorily explains most of the variability in the data, as well as relationship with the
response variable.
Did you like reading this article ? Did you understand this technique ? Do share your
suggestions / opinions in the comments section below.
You can test your skills and knowledge. Check out Live Competitions
(http://datahack.analyticsvidhya.com/contest/all) and compete with
bestData Scientists from all over the world.
Share this:
(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?
share=linkedin&nb=1)
146
share=facebook&nb=1)
165
(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?share=googleplus
1&nb=1)
share=twitter&nb=1)
share=pocket&nb=1)
(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?share=reddit&nb=1)
RELATED
18/33
23/03/2016
(http://www.analyticsvidhya.co
m/blog/2016/02/free-readbooks-statistics-mathematics-
m/blog/2016/01/guide-dataexploration/)
m/blog/2016/01/completetutorial-learn-data-science-
data-science/)
A Comprehensive guide to Data

Exploration
m/blog/2016/01/guide-dataexploration/)
python-scratch-2/)
Free Must Read Books on

Statistics & Mathematics for
Data Science
m/blog/2016/02/free-readbooks-statistics-mathematicsdata-science/)
A Complete Tutorial to Learn

Data Science with Python from
Scratch
m/blog/2016/01/completetutorial-learn-data-sciencepython-scratch-2/)
In "Business Analytics"
In "Machine Learning"
In "Machine Learning"
TAGS: EXPLAINED VARIANCE (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/EXPLAINED-VARIANCE/), FACTOR ANALYSIS

(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/FACTOR-ANALYSIS/), FIRST COMPONENTS
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/FIRST-COMPONENTS/), NORMALIZATION
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/NORMALIZATION/), PCA IN PYTHON (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/PCA-INPYTHON/), PCA IN R (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/PCA-IN-R/), PRINCIPAL COMPONENT ANALYSIS
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/PRINCIPAL-COMPONENT-ANALYSIS/), SCREE PLOT
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/SCREE-PLOT/), STATISTICS (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/STATISTICS/)
Previous Article
Next Article
Winning Solutions of DYD

Competition R and XGBoost Ruled
Course Review Big data and

Hadoop Developer Certification
(http://www.analyticsvidhya.com/blog/2016/03/winning- Course by Simplilearn
solutions-dyd-competition-xgboostruled/)
(http://www.analyticsvidhya.com/blog/2016/03/reviewbig-data-hadoop-developer-certificationsimplilearn/)
19/33
23/03/2016
(http://www.analyticsvidhya.com/blog/author/manish-saraswat/)
Author
Manish Saraswat
(http://www.analyticsvidhya.com/blog/author/manishsaraswat/)
I believe education can change this world. Knowledge is the most powerful asset one can
build. It builds up like compound interest. I am passionate about helping people. I care
about animals, unprivileged people, sharing knowledge, health and books. R, Data Science
and Machine Learning keep me busy. Try. Bleed. Succeed.
(https://twitter.com/Manish_Saraswt)
(https://in.linkedin.com/in/saraswatmanish)
21 COMMENTS
Tuhin Chattopadhyay (http://www.tuhinchattopadhyay.com/) says:

REPLY (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPAL-COMPONENT-ANALYSIS-PYTHON/?
20/33
23/03/2016
REPLYTOCOM=107888#RESPOND)
MARCH 21, 2016 AT 6:19 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALCOMPONENT-ANALYSIS-PYTHON/#COMMENT-107888)
Excellent Manish
Manish Saraswat says:

MARCH 21, 2016 AT 6:37 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107890#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107890)
Your appreciation means a lot.

Thank you Tuhin Sir
Surobhi says:
Hi Manish,
Information given about PCA in your article was very comprehensive as you have
covered both the theoretical and the implementation part very well. It was fun and
simple to understand too. Can you please write a similar one for Factor Analysis? How is
it different from PCA and how to decide on the method of dimensional reduction case
to case. Thanks

Thanks Surobhi !
I already have it in my plan to write soon one detailed post on Factor Analysis. Wish me
luck!
Prasoon Saxena says:

21/33
23/03/2016
This is good explanation Manish and thank you for sharing it.
Quick question, model created using these 30pca will have all 50 independent variable
but if I want to figure out what among those 50 independent variables which are most
critical one then how we figure that so that we can build model using those specific
variables.
Will appreciate your help.
Thanks

Hello
For model building, well use the resultant 30 components as independent variables.
Remember, each component is a vector comprising of principal component score
derived from each predictor variable (in this case we have 50). Check
prin_comp$rotation for principal component scores in each vector. This technique is
used to shrink the dimension of a data set such that it becomes easier to analyze,
visualize and interpret.
By critical, I assume you are talking about measuring variable importance. If thats the
case, you can look for p values, t statistics in regression. For variable selection,
regression is equipped with various approaches such as forward selection, backward
selection, step wise selection etc.
Muthupandi says:
MARCH 21, 2016 AT 12:45 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107922#RESPOND)
Hi Manish
Thanks for the article, had good explanation and walk through. In PCA on R u have
shown hot to get the name of the variable and its missing in python, can u explain how i
can get the names of the variables after reduction so that we can use it for model
building? or am i understanding it wrongly

22/33
23/03/2016
Hi
I didnt understand when you say name of the variable. Did you mean rotation matrix ?
Please elaborate on that.
Secondly, once you have done pca, you dont need to use variables for modeling,
instead use the resultant combination of components as independent variables which
leads to highest explained variance.
Muthupandi says:
MARCH 22, 2016 AT 5:24 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALCOMPONENT-ANALYSIS-PYTHON/#COMMENT-107981)
Variable name am referring to is Item_Weight,Item_Fat_ContentLF

Item_Fat_Contentlow fat Item_Fat_ContentLow Fat,Item_Fat_Contentreg etc.,
so we have to use X1 to train the model?
Hunaidkhan (http://none) says:

Really informative Manish, Also variables derived from PCA can be used for Regression
analysis. Regression analysis with PCA gives a better prediction and less error.

Rightly said. PCA when used for regression takes a form of supervised approach known
as PLS (partial least squares). In PLS, the response variable is used for identification of
principal components.
Debarshi says:
23/33
23/03/2016
I have used PCA recently in one projects, and would like to add few points:
-PCA reduce the dimension but the the result is not very intuitive, as each PCs are
combination of all the original variables. So use Factor Analysis (Factor Rotation) on top
of PCA to get a better relationship between PCs (rather Factors) and original Variable,
this result was brilliant in an Insurance data.
-If you have perfectly correlated variables (A & B) then also PCA will not suggest you to
drop one, rather it will suggest to use a combination of these two (A+B), but off course it
will reduce the dimension
-This is different from feature selection, dont mix these two concept
-There is a concept of Nonlinear PCA which helps to include non Numeric values as
well.
-If you want to reduce the dimension (or numbers) of predictors (X) remember PCA does
not consider response (Y) while reducing the dimension, your original variables may be
(??) a better predictors.

Thanks a lot Debarshi. These are so much helpful.
Venu says:
Good One
Pallavi says:
Hi Manish,
Another good article! I have always found it difficult to explain the principle components
to business users. Would really appreciate, if you also write how do you explain the PCA
to business users What general questions you get from business users and how to
handle those.
24/33
23/03/2016
Thanks
Dox vK (http://none) says:

I understand there is a PCA for qualitative data could some one provide me with a
good intutive resource for suvh?
Krishna says:
Hi Manish,
The article is very helpful. While we normalize the data for numeric variables, do we
need to remove outliers if any exists in the data before performing PCA?
Also looks like , implementation of final model in production is quite tedious, as we
always have to compute components prior scoring.
Thanks,
Krishna

Hello
Also mentioned in the article, data cleaning (removal of outliers, imputing missing
values) are important prior to implementing principal component analysis. Such things
only adds noise and inconsistency in the data. Hence, it is a good practice to sort them
out first.
I beg to differ on this procedure being tedious. For the sake of understanding, Ive
explained each and every step used in this technique, may be this makes it tedious.
However, if you think you have understood it correctly, just pick the important
commands (codes) and get to the scree plot stage in no time.
25/33
23/03/2016
sandy says:
nice one explanation
Ankur says:
Hi Manish,
while running the command >prin_comp <- prcomp(new_my_data, scale. = T)
it giving error "Error in svd(x, nu = 0) : infinite or missing values in 'x'"
how to rectify it.
BTW a GREAT article..

Hello
This error says Your data set has missing values. Please Impute.
To rectify, run the code once again where I dealt with missing values. Good Luck!
LEAVE A REPLY
Connect with:
(http://www.analyticsvidhya.com/wp-login.php?
action=wordpress_social_authenticate&mode=login&provider=Facebook&redirect_to=http%3A%2F%2F
guide-principal-component-analysis-python%2F)
Your email address will not be published.
26/33
23/03/2016
Comment
Name (required)
Email (required)
Website
Notify me of follow-up comments by email.
SUBMIT COMMENT
Notify me of new posts by email.
TOP USERS
Rank
1
Name
Points
Nalin Pasricha
(http://datahack.analyticsvidhya.com/user/profile/Nalin)
3789
SRK (http://datahack.analyticsvidhya.com/user/profile/SRK)
3696
Aayushmnit
(http://datahack.analyticsvidhya.com/user/profile/aayushmnit)
3304
binga (http://datahack.analyticsvidhya.com/user/profile/binga)
2954
27/33
23/03/2016
vikash (http://datahack.analyticsvidhya.com/user/profile/vikash)
2312
More Rankings (http://datahack.analyticsvidhya.com/users)
(http://pgpba.greatlakes.edu.in/?
utm_source=AVM&utm_medium=Banner&utm_campaign=Pgpba_decjan)
POPULAR POSTS
A Complete Tutorial to learn Data Science in R from Scratch
(http://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-sciencescratch/)
Free Must Read Books on Statistics & Mathematics for Data Science
(http://www.analyticsvidhya.com/blog/2016/02/free-read-books-statistics-mathematicsdata-science/)
A Complete Tutorial to Learn Data Science with Python from Scratch
(http://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-sciencepython-scratch-2/)
Essentials of Machine Learning Algorithms (with Python and R Codes)
(http://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/)
A Complete Tutorial on Time Series Modeling in R
(http://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/)
7 Types of Regression Techniques you should know!
(http://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/)
28/33
23/03/2016
SAS vs. R (vs. Python) which tool should I learn?

(http://www.analyticsvidhya.com/blog/2014/03/sas-vs-vs-python-tool-learn/)
Quick Insights: India Analytics and Big Data Salary Report 2016
(http://www.analyticsvidhya.com/blog/2016/02/quick-insights-analytics-big-data-salaryreport-2016/)
(http://praxis.ac.in/fulltime-bigdata-analytics?
utm_source=avbanner)
(http://imarticus.org/programs/business-analytics-
professional/)
RECENT POSTS
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variables-
29/33
23/03/2016
boruta-package/)
How to perform feature selection (i.e. pick important variables) using Boruta Package in R ?
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variables-borutapackage/)
GUEST BLOG , MARCH 22, 2016
(http://www.analyticsvidhya.com/blog/2016/03/review-big-data-hadoopdeveloper-certification-simplilearn/)
Course Review Big data and Hadoop Developer Certification Course by Simplilearn
(http://www.analyticsvidhya.com/blog/2016/03/review-big-data-hadoop-developercertification-simplilearn/)
KUNAL JAIN , MARCH 22, 2016
(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principalcomponent-analysis-python/)

(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-componentanalysis-python/)
MANISH SARASWAT , MARCH 21, 2016
(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dydcompetition-xgboost-ruled/)
Winning Solutions of DYD Competition R and XGBoost Ruled

(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dyd-competitionxgboost-ruled/)
30/33
23/03/2016
(http://www.edvancer.in/certified-business-analytics?
utm_source=AV&utm_medium=AVads&utm_campaign=AVads1&utm_content=cbapavad)
GET CONNECTED
4,372
FOLLOWERS
(http://www.twitter.com/analyticsvidhya)
959
FOLLOWERS
(https://plus.google.com/+Analyticsvidhya)
12,675
FOLLOWERS
(http://www.facebook.com/Analyticsvidhya)
Email
SUBSCRIBE
(http://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
ABOUT US
For those of you, who are wondering what is Analytics Vidhya, Analytics can be defined as the
science of extracting insights from raw data. The spectrum of analytics starts from capturing data
and evolves into using insights / trends from this data to make informed decisions.
STAY CONNECTED
4,372
12,675
31/33
23/03/2016
4,372
FOLLOWERS
(http://www.twitter.com/analyticsvidhya)
959
FOLLOWERS
(https://plus.google.com/+Analyticsvidhya)
LATEST POSTS
12,675
FOLLOWERS
(http://www.facebook.com/Analyticsvidhya)
Email
SUBSCRIBE
(http://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variablesboruta-package/)
How to perform feature selection (i.e. pick important variables) using Boruta Package in R ?
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variables-borutapackage/)
GUEST BLOG , MARCH 22, 2016
(http://www.analyticsvidhya.com/blog/2016/03/review-big-data-hadoopdeveloper-certification-simplilearn/)
Course Review Big data and Hadoop Developer Certification Course by Simplilearn
(http://www.analyticsvidhya.com/blog/2016/03/review-big-data-hadoop-developercertification-simplilearn/)
KUNAL JAIN , MARCH 22, 2016
(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principalcomponent-analysis-python/)

(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-componentanalysis-python/)
(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dydcompetition-xgboost-ruled/)

32/33
23/03/2016

(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dyd-competitionxgboost-ruled/)
QUICK LINKS
Home (http://www.analyticsvidhya.com/)
About Us (http://www.analyticsvidhya.com/aboutme/)
Our team (http://www.analyticsvidhya.com/aboutme/team/)
Privacy Policy
(http://www.analyticsvidhya.com/privacy-policy/)
Refund Policy
(http://www.analyticsvidhya.com/refund-policy/)
Terms of Use
(http://www.analyticsvidhya.com/terms/)
TOP REVIEWS
Copyright 2015 Analytics Vidhya
33/33

Practical Guide To Principal Component Analysis (PCA) in R & Python

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Practical Guide To Principal Component Analysis (PCA) in R & Python

Încărcat de

Drepturi de autor:

Formate disponibile

23/03/2016

Practical Guide to Principal Component Analysis (PCA) in R & Python

Home (http://www.analyticsvidhya.com/) Python (http://www.analyticsvidhya.com/blog/category/python-2/) Pr...

Practical Guide to Principal Component Analysis

Practical Guide to Principal Component Analysis (PCA) in R & Python

Note: Understanding this concept requires prior knowledge of statistics.

What is Principal Component Analysis ?

Practical Guide to Principal Component Analysis (PCA) in R & Python

What is Principal Component Analysis ?

Practical Guide to Principal Component Analysis (PCA) in R & Python

Source: nlpca (http://www.nlpca.org/pca_principal_component_analysis.html)

What are principal components ?

,..) of first principal component.

Practical Guide to Principal Component Analysis (PCA) in R & Python

standard deviation equals to one.

Second principal component ( Z) is also a linear combination of original predictors which

Practical Guide to Principal Component Analysis (PCA) in R & Python

n pdimensional data, min(n-1, p) principal component can be constructed.

Why is normalization of variables necessary ?

Practical Guide to Principal Component Analysis (PCA) in R & Python

Implement PCA in R & Python (with interpretation)

Practical Guide to Principal Component Analysis (PCA) in R & Python

Practical Guide to Principal Component Analysis (PCA) in R & Python

To check, if we now have a data set of integer values, simple write:

Practical Guide to Principal Component Analysis (PCA) in R & Python

The prcomp() function results in 5 useful measures:

Practical Guide to Principal Component Analysis (PCA) in R & Python

Practical Guide to Principal Component Analysis (PCA) in R & Python

Outlet_TypeSupermarket, Outlet_Establishment_Year 2007. Similarly, it can be said that the

Practical Guide to Principal Component Analysis (PCA) in R & Python

Outlet_Sizeother. For exact measure of a variable in a component, you should look at

Practical Guide to Principal Component Analysis (PCA) in R & Python

Practical Guide to Principal Component Analysis (PCA) in R & Python

Practical Guide to Principal Component Analysis (PCA) in R & Python

Practical Guide to Principal Component Analysis (PCA) in R & Python

Practical Guide to Principal Component Analysis (PCA) in R & Python

Practical Guide to Principal Component Analysis (PCA) in R & Python

A Comprehensive guide to Data

Free Must Read Books on

A Complete Tutorial to Learn

TAGS: EXPLAINED VARIANCE (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/EXPLAINED-VARIANCE/), FACTOR ANALYSIS

Winning Solutions of DYD

Course Review Big data and

Practical Guide to Principal Component Analysis (PCA) in R & Python

Tuhin Chattopadhyay (http://www.tuhinchattopadhyay.com/) says:

Practical Guide to Principal Component Analysis (PCA) in R & Python

Manish Saraswat says:

Your appreciation means a lot.

Manish Saraswat says:

Prasoon Saxena says:

Practical Guide to Principal Component Analysis (PCA) in R & Python

Manish Saraswat says:

Manish Saraswat says:

Practical Guide to Principal Component Analysis (PCA) in R & Python

Variable name am referring to is Item_Weight,Item_Fat_ContentLF

Hunaidkhan (http://none) says:

Manish Saraswat says:

Practical Guide to Principal Component Analysis (PCA) in R & Python

Manish Saraswat says:

Thanks a lot Debarshi. These are so much helpful.

Practical Guide to Principal Component Analysis (PCA) in R & Python

Dox vK (http://none) says:

Manish Saraswat says: