Documente Academic
Documente Profesional
Documente Cultură
(https://www.facebook.com/AnalyticsVidhya)
(https://twitter.com/analyticsvidhya)
(https://plus.google.com/+Analyticsvidhya/posts)
(https://www.linkedin.com/groups/Analytics-Vidhya-Learn-everything-about-5057165)
(http://www.analyticsvidhya.com)
(http://datahack.analyticsvidhya.com/contest/practice-problem-time-series)
(HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/R/)
harer.php?u=http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-
20Principal%20Component%20Analysis%20(PCA)%20in%20R%20&%20Python)
(https://twitter.com/home?
rincipal%20Component%20Analysis%20(PCA)%20in%20R%20&%20Python+http://www.analyticsvidhya.com/blog/2016/03/practical-
hon/)
(https://plus.google.com/share?url=http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-
com/pin/create/button/?url=http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-
dhya.com/wp-
iption=Practical%20Guide%20to%20Principal%20Component%20Analysis%20(PCA)%20in%20R%20&%20Python)
(http://admissions.bridgesom.com/pba-new/?
utm_source=AV&utm_medium=Banner&utm_campaign=AVBanner)
Introduction
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
1/33
23/03/2016
Introduction
Too much of anything is good for nothing!
What happens when a data set has too many variables ? Here are few possible situations
which you might come across:
1. You find thatmost of thevariables are correlated.
2. You lose patience and decide to run a model on whole data. This returnspoor accuracy
andyou feel terrible.
3. You become indecisive about what to do
4. You start thinking of some strategic method to find few important variables
Trust me, dealing with such situations isnt as difficult as it sounds. Statistical techniques
such asfactor analysis, principal component analysis help to overcome such difficulties.
In this post, Ive explained the concept of principal component analysis in detail. Ive kept
the explanation to be simple and informative. For practical understanding, Ive also
demonstrated using this technique in R with interpretations.
2/33
23/03/2016
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
3/33
23/03/2016
where,
Z isfirst principal component
p is the loading vector comprising of loadings (
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
4/33
23/03/2016
Therefore,
First principal component is a linear combination of original predictor variables which
captures the maximum variance in the data set. It determines the direction of highest
variability in the data. Larger the variability captured in first component, larger the
information captured by component.No other component can have variability higher than
first principal component.
The first principal component results in a line which is closest to the data i.e. it minimizes the
sum of squared distance between a data point and the line.
Similarly, we can compute the second principal component also.
If the two components are uncorrelated, their directions should be orthogonal (image
below). This image is based on a simulated data with 2 predictors. Notice the direction of the
components, as expected they are orthogonal. This suggests the correlation b/w these
components in zero.
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
5/33
23/03/2016
All succeeding principal component follows a similar concept i.e. they capture the
remaining variation without being correlated with the previous component. In general, for
Note: Partial least square (PLS) is a supervised alternative to PCA. PLS assigns higher weight
to variables which are strongly related to response variable to determine principal
components.
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
6/33
23/03/2016
Performing PCA on un-normalized variables will lead to insanely large loadings for variables
with high variance. In turn, this will lead to dependence of a principal component on the
variable with high variance. This is undesirable.
As shown in image below, PCA was run on a data set twice (with unscaled and scaled
predictors). This data set has ~40 variables. You can see, first principal component is
dominated by a variable Item_MRP. And, second principal component is dominated by a
variable Item_Weight. This domination prevails due to high value of variance associated
with a variable. When the variables are scaled, we get a much better representation of
variables in 2D space.
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
7/33
23/03/2016
Remember, PCA can be applied only on numerical data. Therefore, if the data has
categorical variables they must be converted to numerical. Also, make sure you have done
the basic data cleaning prior to implementing this technique. Lets quickly finish with initial
data loading and cleaning steps:
#directorypath
>path<".../Data/Big_Mart_Sales"
#setworkingdirectory
>setwd(path)
#loadtrainandtestfile
>train<read.csv("train_Big.csv")
>test<read.csv("test_Big.csv")
#addacolumn
>test$Item_Outlet_Sales<1
#combinethedataset
>combi<rbind(train,test)
#imputemissingvalueswithmedian
>combi$Item_Weight[is.na(combi$Item_Weight)]<median(combi$Item_Weight,na.rm=TR
#impute0withmedian
>combi$Item_Visibility<ifelse(combi$Item_Visibility==0,median(combi$Item_Visib
#findmodeandimpute
>table(combi$Outlet_Size,combi$Outlet_Type)
>levels(combi$Outlet_Size)[1]<"Other"
Till here, weve imputed missing values. Now we are left with removing the dependent
(response) variable and other identifier variables( if any). As we said above, we are practicing
an unsupervised learning technique, hence response variable must be removed.
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
8/33
23/03/2016
#removethedependentandidentifiervariables
>my_data<subset(combi,select=c(Item_Outlet_Sales,Item_Identifier,
Lets check the available variables ( a.k.a predictors) in the data set.
#checkavailablevariables
>colnames(my_data)
Since PCA works on numeric variables, lets see if we have any variable other than numeric.
#checkvariableclass
>str(my_data)
'data.frame':14204obs.of9variables:
$Item_Weight:num9.35.9217.519.28.93...
$Item_Fat_Content:Factorw/5levels"LF","lowfat",..:3535355355...
$Item_Visibility:num0.0160.01930.01680.0540.054...
$Item_Type:Factorw/16levels"BakingGoods",..:515117101141466...
$Item_MRP:num249.848.3141.6182.153.9...
$Outlet_Establishment_Year:int1999200919991998198720091987198520022007..
$Outlet_Size:Factorw/4levels"Other","High",..:3331232311...
$Outlet_Location_Type:Factorw/3levels"Tier1","Tier2",..:1313333322
$Outlet_Type:Factorw/4levels"GroceryStore",..:2321232422...
Sadly, 6 out of 9 variables are categorical in nature. We have some additional work to do
now. Well convert these categorical variables into numeric using one hot encoding.
#loadlibrary
>library(dummies)
#createadummydataframe
>new_my_data<dummy.data.frame(my_data,names=c("Item_Fat_Content","Item_Type",
"Outlet_Establishment_Year","Outlet_Size",
"Outlet_Location_Type","Outlet_Type"))
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
9/33
23/03/2016
#checkthedataset
>str(new_my_data)
And, we now have all the numerical values. We can now go ahead with PCA.
The base R function prcomp() is used to performPCA. By default, it centers the variable to
have mean equals to zero. With parameter scale.=T, we normalize the variables to have
standard deviation equals to 1.
#principalcomponentanalysis
>prin_comp<prcomp(new_my_data,scale.=T)
>names(prin_comp)
[1]"sdev""rotation""center""scale""x"
2. The rotation measure provides the principal component loading. Each column of rotation
matrix contains the principal component loading vector. This is the most important measure
we should be interested in.
>prin_comp$rotation
This returns 44 principal components loadings. Is that correct ? Absolutely. In a data set, the
maximum number of principal component loadings is a minimum of (n-1, p). Lets look at
first 4 principal components and first 5 rows.
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
10/33
23/03/2016
>prin_comp$rotation[1:5,1:4]
PC1PC2PC3PC4
Item_Weight0.00544292250.0012856660.0112461940.011887106
Item_Fat_ContentLF0.00219833140.0037685570.0097900940.016789483
Item_Fat_Contentlowfat0.00190427100.0018669050.0030664150.018396143
Item_Fat_ContentLowFat0.00279364670.0022343280.0283098110.056822747
Item_Fat_Contentreg0.00029363190.0011209310.0090332540.001026615
3. In order to compute the principal component score vector, we dont need to multiply the
loading with data. Rather, the matrix x has the principal component score vectors in a
14204 44 dimension.
(http://discuss.analyticsvidhya.com)
> dim(prin_comp$x)
[1] 14204 44
Lets plot the resultant principal components.
>biplot(prin_comp,scale=0)
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
11/33
23/03/2016
The parameter scale=0 ensures that arrows are scaled to represent the loadings. To
make inference from image above, focus on the extreme ends (top, bottom, left, right) of
this graph.
We
infer
than
first
principal
component
corresponds
to
measure
of
component
corresponds
to
measure
of
Outlet_Location_TypeTier1,
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
12/33
23/03/2016
We aim to find the components which explain the maximum variance. This is because, we
want to retain as much information as possible using these components. So, higher is the
explained variance, higher will be the information contained in those components.
To compute the proportion of variance explained by each component, we simply divide the
variance by sum of total variance. This results in:
#proportionofvarianceexplained
>prop_varex<pr_var/sum(pr_var)
>prop_varex[1:20]
[1]0.103718530.073129580.062380140.057752070.049958000.04580274
[7]0.043910810.028564330.027358880.026547740.025598760.02556797
[13]0.025495160.025088310.024939320.024909380.024683130.02446016
[19]0.023903670.02371118
This shows that first principal component explains 10.3% variance. Second component
explains 7.3% variance. Third component explains 6.2% variance and so on. So, how do we
decide how many components should we select for modeling stage ?
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
13/33
23/03/2016
The answer to this question is provided by a scree plot. A scree plot is used to access
components or factors which explains the most of variability in the data. It represents values
in descending order.
#screeplot
>plot(prop_varex,xlab="PrincipalComponent",
ylab="ProportionofVarianceExplained",
type="b")
The plot above shows that ~ 30 components explains around 98.4% variance in the data set.
In order words, using PCA we have reduced 44 predictors to 30 without compromising on
explained variance. This is the power of PCA> Lets do a confirmation check, by plotting a
cumulative variance plot. This will give us a clear picture of number of components.
#cumulativescreeplot
>plot(cumsum(prop_varex),xlab="PrincipalComponent",
ylab="CumulativeProportionofVarianceExplained",
type="b")
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
14/33
23/03/2016
This plot shows that 30 components results in variance close to ~ 98%. Therefore, in this
case, well select number of components as 30 [PC1 to PC30] and proceed to the modeling
stage. This completes the steps to implement PCA in R. For modeling, well use these 30
components as predictor variables and follow the normal procedures.
For Python Users: To implement PCA in python, simply import PCA from sklearn library. The
interpretation remains same as explained for R users above. Ofcourse, the result is some as
derived after using R. The data set used for Python is a cleaned version where missing
values have been imputed, and categorical variables are converted into numeric.
importnumpyasnp
fromsklearn.decompositionimportPCA
importpandasaspd
importmatplotlib.pyplotasplt
fromsklearn.preprocessingimportscale
%matplotlibinline
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
15/33
23/03/2016
#Loaddataset
data=pd.read_csv('Big_Mart_PCA.csv')
#convertittonumpyarrays
X=data.values
#Scalingthevalues
X=scale(X)
pca=PCA(n_components=44)
pca.fit(X)
#TheamountofvariancethateachPCexplains
var=pca.explained_variance_ratio_
#CumulativeVarianceexplains
var1=np.cumsum(np.round(pca.explained_variance_ratio_,decimals=4)*100)
printvar1
[10.3717.6823.9229.734.739.2843.6746.5349.27
51.9254.4857.0459.5962.164.5967.0869.5572.
74.3976.7679.181.4483.7786.0688.3390.5992.7
94.7696.7898.44100.01100.01100.01100.01100.01100.01
100.01100.01100.01100.01100.01100.01100.01100.01]
plt.plot(var1)
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
16/33
23/03/2016
#LookingataboveplotI'mtaking30variables
pca=PCA(n_components=30)
pca.fit(X)
X1=pca.fit_transform(X)
printX1
For more information on PCA in python, visit scikit learn documentation (http://scikitlearn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
Points to Remember
1. PCA is used to extract important features from a data set.
2. These features are low dimensional in nature.
3. These features a.k.a components are a resultant of normalized linear combination of original
predictor variables.
4. These components aim to capture as much information as possible with high explained
variance.
5. The first component has the highest variance followed by second, third and so on.
6. The components must be uncorrelated (remember orthogonal direction ? ). See above.
7. Normalizing data becomesextremely important when the predictors are measured in
different units.
8. PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions,
it becomes increasingly difficult to make interpretations from the resultant cloud of data.
9. PCA is applied on a data set with numeric variables.
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
17/33
23/03/2016
10. PCA is a tool which helps to produce better visualizations of high dimensional data.
End Notes
This brings me to the end of this tutorial. Without delving deep into mathematics, Ive tried
to make you familiar with most important concepts required to use this technique. Its
simple but needs special attention while deciding the number of components. Practically,
we should strive to retain only first few k components
The idea behind pca is to construct some principal components( Z << Xp ) which
satisfactorily explains most of the variability in the data, as well as relationship with the
response variable.
Did you like reading this article ? Did you understand this technique ? Do share your
suggestions / opinions in the comments section below.
You can test your skills and knowledge. Check out Live Competitions
(http://datahack.analyticsvidhya.com/contest/all) and compete with
bestData Scientists from all over the world.
Share this:
(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?
share=linkedin&nb=1)
146
(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?
share=facebook&nb=1)
165
(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?share=googleplus
1&nb=1)
(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?
share=twitter&nb=1)
(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?
share=pocket&nb=1)
(http://www.analyticsvidhya.com/blog/2016/03/practicalguideprincipalcomponentanalysispython/?share=reddit&nb=1)
RELATED
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
18/33
23/03/2016
(http://www.analyticsvidhya.co
(http://www.analyticsvidhya.co
(http://www.analyticsvidhya.co
m/blog/2016/02/free-readbooks-statistics-mathematics-
m/blog/2016/01/guide-dataexploration/)
m/blog/2016/01/completetutorial-learn-data-science-
data-science/)
python-scratch-2/)
In "Business Analytics"
In "Machine Learning"
In "Machine Learning"
Previous Article
Next Article
(http://www.analyticsvidhya.com/blog/2016/03/reviewbig-data-hadoop-developer-certificationsimplilearn/)
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
19/33
23/03/2016
(http://www.analyticsvidhya.com/blog/author/manish-saraswat/)
Author
Manish Saraswat
(http://www.analyticsvidhya.com/blog/author/manishsaraswat/)
I believe education can change this world. Knowledge is the most powerful asset one can
build. It builds up like compound interest. I am passionate about helping people. I care
about animals, unprivileged people, sharing knowledge, health and books. R, Data Science
and Machine Learning keep me busy. Try. Bleed. Succeed.
(https://twitter.com/Manish_Saraswt)
(https://in.linkedin.com/in/saraswatmanish)
21 COMMENTS
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
20/33
23/03/2016
REPLYTOCOM=107888#RESPOND)
MARCH 21, 2016 AT 6:19 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALCOMPONENT-ANALYSIS-PYTHON/#COMMENT-107888)
Excellent Manish
Surobhi says:
REPLY (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPAL-COMPONENT-ANALYSIS-PYTHON/?
MARCH 21, 2016 AT 7:24 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107894#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107894)
Hi Manish,
Information given about PCA in your article was very comprehensive as you have
covered both the theoretical and the implementation part very well. It was fun and
simple to understand too. Can you please write a similar one for Factor Analysis? How is
it different from PCA and how to decide on the method of dimensional reduction case
to case. Thanks
Thanks Surobhi !
I already have it in my plan to write soon one detailed post on Factor Analysis. Wish me
luck!
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
21/33
23/03/2016
This is good explanation Manish and thank you for sharing it.
Quick question, model created using these 30pca will have all 50 independent variable
but if I want to figure out what among those 50 independent variables which are most
critical one then how we figure that so that we can build model using those specific
variables.
Will appreciate your help.
Thanks
Hello
For model building, well use the resultant 30 components as independent variables.
Remember, each component is a vector comprising of principal component score
derived from each predictor variable (in this case we have 50). Check
prin_comp$rotation for principal component scores in each vector. This technique is
used to shrink the dimension of a data set such that it becomes easier to analyze,
visualize and interpret.
By critical, I assume you are talking about measuring variable importance. If thats the
case, you can look for p values, t statistics in regression. For variable selection,
regression is equipped with various approaches such as forward selection, backward
selection, step wise selection etc.
Muthupandi says:
REPLY (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPAL-COMPONENT-ANALYSIS-PYTHON/?
MARCH 21, 2016 AT 12:45 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107922#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107922)
Hi Manish
Thanks for the article, had good explanation and walk through. In PCA on R u have
shown hot to get the name of the variable and its missing in python, can u explain how i
can get the names of the variables after reduction so that we can use it for model
building? or am i understanding it wrongly
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
22/33
23/03/2016
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107951)
Hi
I didnt understand when you say name of the variable. Did you mean rotation matrix ?
Please elaborate on that.
Secondly, once you have done pca, you dont need to use variables for modeling,
instead use the resultant combination of components as independent variables which
leads to highest explained variance.
Muthupandi says:
MARCH 22, 2016 AT 5:24 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALCOMPONENT-ANALYSIS-PYTHON/#COMMENT-107981)
Really informative Manish, Also variables derived from PCA can be used for Regression
analysis. Regression analysis with PCA gives a better prediction and less error.
Rightly said. PCA when used for regression takes a form of supervised approach known
as PLS (partial least squares). In PLS, the response variable is used for identification of
principal components.
Debarshi says:
REPLY (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPAL-COMPONENT-ANALYSIS-PYTHON/?
MARCH 21, 2016 AT 8:46 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107902#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107902)
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
23/33
23/03/2016
I have used PCA recently in one projects, and would like to add few points:
-PCA reduce the dimension but the the result is not very intuitive, as each PCs are
combination of all the original variables. So use Factor Analysis (Factor Rotation) on top
of PCA to get a better relationship between PCs (rather Factors) and original Variable,
this result was brilliant in an Insurance data.
-If you have perfectly correlated variables (A & B) then also PCA will not suggest you to
drop one, rather it will suggest to use a combination of these two (A+B), but off course it
will reduce the dimension
-This is different from feature selection, dont mix these two concept
-There is a concept of Nonlinear PCA which helps to include non Numeric values as
well.
-If you want to reduce the dimension (or numbers) of predictors (X) remember PCA does
not consider response (Y) while reducing the dimension, your original variables may be
(??) a better predictors.
Venu says:
REPLY (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPAL-COMPONENT-ANALYSIS-PYTHON/?
MARCH 21, 2016 AT 8:49 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107903#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107903)
Good One
Pallavi says:
REPLY (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPAL-COMPONENT-ANALYSIS-PYTHON/?
MARCH 21, 2016 AT 10:42 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107910#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107910)
Hi Manish,
Another good article! I have always found it difficult to explain the principle components
to business users. Would really appreciate, if you also write how do you explain the PCA
to business users What general questions you get from business users and how to
handle those.
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
24/33
23/03/2016
Thanks
I understand there is a PCA for qualitative data could some one provide me with a
good intutive resource for suvh?
Krishna says:
REPLY (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPAL-COMPONENT-ANALYSIS-PYTHON/?
MARCH 21, 2016 AT 1:17 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107925#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107925)
Hi Manish,
The article is very helpful. While we normalize the data for numeric variables, do we
need to remove outliers if any exists in the data before performing PCA?
Also looks like , implementation of final model in production is quite tedious, as we
always have to compute components prior scoring.
Thanks,
Krishna
Hello
Also mentioned in the article, data cleaning (removal of outliers, imputing missing
values) are important prior to implementing principal component analysis. Such things
only adds noise and inconsistency in the data. Hence, it is a good practice to sort them
out first.
I beg to differ on this procedure being tedious. For the sake of understanding, Ive
explained each and every step used in this technique, may be this makes it tedious.
However, if you think you have understood it correctly, just pick the important
commands (codes) and get to the scree plot stage in no time.
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
25/33
23/03/2016
sandy says:
REPLY (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPAL-COMPONENT-ANALYSIS-PYTHON/?
MARCH 21, 2016 AT 3:48 PM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107945#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107945)
Ankur says:
REPLY (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPAL-COMPONENT-ANALYSIS-PYTHON/?
MARCH 22, 2016 AT 8:58 AM (HTTP://WWW.ANALYTICSVIDHYA.COM/BLOG/2016/03/PRACTICAL-GUIDE-PRINCIPALREPLYTOCOM=107989#RESPOND)
COMPONENT-ANALYSIS-PYTHON/#COMMENT-107989)
Hi Manish,
while running the command >prin_comp <- prcomp(new_my_data, scale. = T)
it giving error "Error in svd(x, nu = 0) : infinite or missing values in 'x'"
how to rectify it.
BTW a GREAT article..
Hello
This error says Your data set has missing values. Please Impute.
To rectify, run the code once again where I dealt with missing values. Good Luck!
LEAVE A REPLY
Connect with:
(http://www.analyticsvidhya.com/wp-login.php?
action=wordpress_social_authenticate&mode=login&provider=Facebook&redirect_to=http%3A%2F%2F
guide-principal-component-analysis-python%2F)
Your email address will not be published.
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
26/33
23/03/2016
Comment
Name (required)
Email (required)
Website
SUBMIT COMMENT
TOP USERS
Rank
1
Name
Points
Nalin Pasricha
(http://datahack.analyticsvidhya.com/user/profile/Nalin)
3789
SRK (http://datahack.analyticsvidhya.com/user/profile/SRK)
3696
Aayushmnit
(http://datahack.analyticsvidhya.com/user/profile/aayushmnit)
3304
binga (http://datahack.analyticsvidhya.com/user/profile/binga)
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
2954
27/33
23/03/2016
vikash (http://datahack.analyticsvidhya.com/user/profile/vikash)
2312
(http://pgpba.greatlakes.edu.in/?
utm_source=AVM&utm_medium=Banner&utm_campaign=Pgpba_decjan)
POPULAR POSTS
A Complete Tutorial to learn Data Science in R from Scratch
(http://www.analyticsvidhya.com/blog/2016/02/complete-tutorial-learn-data-sciencescratch/)
Free Must Read Books on Statistics & Mathematics for Data Science
(http://www.analyticsvidhya.com/blog/2016/02/free-read-books-statistics-mathematicsdata-science/)
A Complete Tutorial to Learn Data Science with Python from Scratch
(http://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-sciencepython-scratch-2/)
Essentials of Machine Learning Algorithms (with Python and R Codes)
(http://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/)
A Complete Tutorial on Time Series Modeling in R
(http://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/)
7 Types of Regression Techniques you should know!
(http://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/)
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
28/33
23/03/2016
(http://praxis.ac.in/fulltime-bigdata-analytics?
utm_source=avbanner)
(http://imarticus.org/programs/business-analytics-
professional/)
RECENT POSTS
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variables-
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
29/33
23/03/2016
boruta-package/)
How to perform feature selection (i.e. pick important variables) using Boruta Package in R ?
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variables-borutapackage/)
GUEST BLOG , MARCH 22, 2016
(http://www.analyticsvidhya.com/blog/2016/03/review-big-data-hadoopdeveloper-certification-simplilearn/)
Course Review Big data and Hadoop Developer Certification Course by Simplilearn
(http://www.analyticsvidhya.com/blog/2016/03/review-big-data-hadoop-developercertification-simplilearn/)
KUNAL JAIN , MARCH 22, 2016
(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principalcomponent-analysis-python/)
(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dydcompetition-xgboost-ruled/)
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
30/33
23/03/2016
(http://www.edvancer.in/certified-business-analytics?
utm_source=AV&utm_medium=AVads&utm_campaign=AVads1&utm_content=cbapavad)
GET CONNECTED
4,372
FOLLOWERS
(http://www.twitter.com/analyticsvidhya)
959
FOLLOWERS
(https://plus.google.com/+Analyticsvidhya)
12,675
FOLLOWERS
(http://www.facebook.com/Analyticsvidhya)
SUBSCRIBE
(http://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
ABOUT US
For those of you, who are wondering what is Analytics Vidhya, Analytics can be defined as the
science of extracting insights from raw data. The spectrum of analytics starts from capturing data
and evolves into using insights / trends from this data to make informed decisions.
STAY CONNECTED
4,372
12,675
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
31/33
23/03/2016
4,372
FOLLOWERS
(http://www.twitter.com/analyticsvidhya)
959
FOLLOWERS
(https://plus.google.com/+Analyticsvidhya)
LATEST POSTS
12,675
FOLLOWERS
(http://www.facebook.com/Analyticsvidhya)
SUBSCRIBE
(http://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variablesboruta-package/)
How to perform feature selection (i.e. pick important variables) using Boruta Package in R ?
(http://www.analyticsvidhya.com/blog/2016/03/select-important-variables-borutapackage/)
GUEST BLOG , MARCH 22, 2016
(http://www.analyticsvidhya.com/blog/2016/03/review-big-data-hadoopdeveloper-certification-simplilearn/)
Course Review Big data and Hadoop Developer Certification Course by Simplilearn
(http://www.analyticsvidhya.com/blog/2016/03/review-big-data-hadoop-developercertification-simplilearn/)
KUNAL JAIN , MARCH 22, 2016
(http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principalcomponent-analysis-python/)
(http://www.analyticsvidhya.com/blog/2016/03/winning-solutions-dydcompetition-xgboost-ruled/)
32/33
23/03/2016
QUICK LINKS
Home (http://www.analyticsvidhya.com/)
About Us (http://www.analyticsvidhya.com/aboutme/)
Our team (http://www.analyticsvidhya.com/aboutme/team/)
Privacy Policy
(http://www.analyticsvidhya.com/privacy-policy/)
Refund Policy
(http://www.analyticsvidhya.com/refund-policy/)
Terms of Use
(http://www.analyticsvidhya.com/terms/)
TOP REVIEWS
http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/
33/33