Principal Component Analysis

Încărcat de

Pranav Mahamulkar

0% au considerat acest document util (0 voturi)

21 vizualizări1 pagină

Steps & brief description of PCA using R

Drepturi de autor

Formate disponibile

DOCX, PDF, TXT sau citiți online pe Scribd

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Raportați acest document

Steps & brief description of PCA using R

Drepturi de autor:

Formate disponibile

Descărcați ca DOCX, PDF, TXT sau citiți online pe Scribd

Indicator pentru conținut neadecvat

0% au considerat acest document util (0 voturi)

21 vizualizări1 pagină

Principal Component Analysis

Încărcat de

Pranav Mahamulkar

Steps & brief description of PCA using R

Drepturi de autor:

Formate disponibile

Descărcați ca DOCX, PDF, TXT sau citiți online pe Scribd

Indicator pentru conținut neadecvat

Salt la pagina

Sunteți pe pagina 1din 1

Căutați în document

Principal Component Analysis

 A method of extracting important variables (in form of components) from a large set of variables available in
a data set.
 Bring out strong patterns from large and complex datasets
 Always performed on a symmetric correlation or covariance matrix. This means the matrix should be
numeric and have standardized data

Steps:

1. Data pre-processing (excluding features based on business acumen, missing value/garbage value treatment
etc.)
2. If any categorical variables are present, convert them to numerical features using one-hot encoding
(Refer: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/ )
3. Divide the data into Train-Test datasets (80-20)
4. Scaling and centring of Training data – Data Normalization based on mean and standard deviation (basically
calculate Z-scores)

Why is Normalization required?

 The principal components are supplied with normalized version of original predictors. This is
because, the original predictors may have different scales. For example: Imagine a data set with
variables’ measuring units as gallons, kilometres, light years etc. It is definite that the scale of
variances in these variables will be large.
 Performing PCA on un-normalized variables will lead to insanely large loadings for variables with
high variance. In turn, this will lead to dependence of a principal component on the variable with
high variance. This is undesirable.
5. Principal Components:
a. Principal Components (PCs) are drawn such that the distance between the perpendicular projections
of the original datapoints on the PC is maximum and the perpendicular distance of a projection from
its original datapoint is minimum
b. In simple words, the proxy of any datapoint on the PC is as close to the original datapoint as possible
while all the proxies on the PC are as distant from each other as possible
c. Once we draw one PC which explains the maximum variance, the second PC is drawn perpendicular
to it such that it explains the next best amount of variance
E.g., if PC1 explains 85% of the variance, then PC2 would be able to explain 10% of the remaining
variance unexplained by the PC1 and that explained by PC3 could be 2% and so on.
d. If majority of variance (usually >95%) is explained by 2 or 3 PCs then it makes sense to do PCA; using
the scree plot (a.k.a. “Elbow curve”) you can decide the number of PCs to consider for your model
e. Post drawing the PCs, we can rotate them to look like usual X-Y axes for simplicity
f. Transformation matrix does all that for you in a single line of command
6. Once the PCA is done, you can see inherent clusters in the data in the PCA score plot
7. This transformed data can now be used to create a model based on the 2/3 PCs as the new “features.”
8. The test data being used for validation must also be normalized using the same mean and standard
deviation as used in step 1 and then the transformation will take place.
9. Post normalization & transformation of Test data, it can be used to evaluate the model.

Reference: https://blog.bioturing.com/2018/06/18/how-to-read-pca-biplots-and-scree-plots/

Find the code here: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-

python/

Homework:

Cereals dataset >> exclude ‘shelf’ and ‘cereal name’ >> Rating is ‘y’ >> Do train-test >> Using train data create PCs>>
Link it to the outcome ‘y’ >> Use this model to predict values for the test dataset

S-ar putea să vă placă și

PCA - Analysis in R - DataCamp
Document20 pagini
PCA - Analysis in R - DataCamp
Cleaver Bright
Încă nu există evaluări
A Hybrid Branch Prediction Scheme: An Integration of Software and Hardware Techniques
Document8 pagini
A Hybrid Branch Prediction Scheme: An Integration of Software and Hardware Techniques
vka_prince
Încă nu există evaluări
Implementing PCA in Python With Scikit
Document6 pagini
Implementing PCA in Python With Scikit
Shobha Kumari Choudhary
Încă nu există evaluări
Efficient Clustering of Dataset Based On Particle Swarm Optimization
Document10 pagini
Efficient Clustering of Dataset Based On Particle Swarm Optimization
TJPRC Publications
Încă nu există evaluări
2324 BigData Lab3
Document6 pagini
2324 BigData Lab3
Elie Al Howayek
Încă nu există evaluări
The Intuition Behind PCA: Machine Learning Assignment
Document11 pagini
The Intuition Behind PCA: Machine Learning Assignment
Palash Ghosh
Încă nu există evaluări
Final Stage For Cocomo Using PCA
Document8 pagini
Final Stage For Cocomo Using PCA
Vikas Mahurkar
Încă nu există evaluări
BPSK Simulink
Document8 pagini
BPSK Simulink
MAAZ KHAN
Încă nu există evaluări
NST Assignment 2010 New
Document8 pagini
NST Assignment 2010 New
sekharl
Încă nu există evaluări
Final Project Report
Document18 pagini
Final Project Report
jstpallav
Încă nu există evaluări
From Quads To Graphs
Document29 pagini
From Quads To Graphs
joy Liu
Încă nu există evaluări
Netw420 Lab 3 Ilab Report Bet
Document12 pagini
Netw420 Lab 3 Ilab Report Bet
Brian Templin
Încă nu există evaluări
Pca 1692550768
Document13 pagini
Pca 1692550768
kan luc N'guessan
Încă nu există evaluări
Computation-Aware Scheme For Software-Based Block Motion Estimation
Document13 pagini
Computation-Aware Scheme For Software-Based Block Motion Estimation
manikandaprabume
Încă nu există evaluări
Data Mining Project 11
Document18 pagini
Data Mining Project 11
Abraham Zeleke
Încă nu există evaluări
PRACTICAL5
Document23 pagini
PRACTICAL5
thundergamerz403
Încă nu există evaluări
3.2 PCA
Document27 pagini
3.2 PCA
Javada Javada
Încă nu există evaluări
Model-Based Avionic Prognostic Reasoner (MAPR) PDF
Document9 pagini
Model-Based Avionic Prognostic Reasoner (MAPR) PDF
lalith.shankar7971
Încă nu există evaluări
How To Implement CDC in Datastage 8.1!: Incremental Loading in The Datastage Can Be Performed by Using The Change Data
Document2 pagini
How To Implement CDC in Datastage 8.1!: Incremental Loading in The Datastage Can Be Performed by Using The Change Data
Yogesh Sharma
Încă nu există evaluări
Modeling 101
Document15 pagini
Modeling 101
สหายดิว ลูกพระอาทิตย์
Încă nu există evaluări
Process Capability Index
Document4 pagini
Process Capability Index
champ2357
Încă nu există evaluări
Jupyter Lab
Document42 pagini
Jupyter Lab
Paul Shaaf
Încă nu există evaluări
R PCA (Principal Component Analysis) - DataCamp
Document54 pagini
R PCA (Principal Component Analysis) - DataCamp
UMESH D R
Încă nu există evaluări
Pravesh 6301
Document11 pagini
Pravesh 6301
Shreyas Paraj
Încă nu există evaluări
Simulation-Based Automatic Generation Signomial and Posynomial Performance Models Analog Integrated Circuit Sizing
Document5 pagini
Simulation-Based Automatic Generation Signomial and Posynomial Performance Models Analog Integrated Circuit Sizing
suchi87
Încă nu există evaluări
Major Courses - Part 2
Document50 pagini
Major Courses - Part 2
Fairos Zakariah
Încă nu există evaluări
Average Power Analysis EPS
Document25 pagini
Average Power Analysis EPS
hiqwerty
Încă nu există evaluări
Very High-Level Synthesis of Datapath and Control Structures For Reconfigurable Logic Devices
Document5 pagini
Very High-Level Synthesis of Datapath and Control Structures For Reconfigurable Logic Devices
8148593856
Încă nu există evaluări
E Thesis Iisc
Document6 pagini
E Thesis Iisc
dnpqamfd
100% (2)
Principal Component Analysis
Document13 pagini
Principal Component Analysis
Shil Shambharkar
Încă nu există evaluări
Assignment
Document24 pagini
Assignment
Santhi Palanisamy
Încă nu există evaluări
Practice Final Exam
Document3 pagini
Practice Final Exam
Thành Thảo
Încă nu există evaluări
Radial Basis Function Neural Networks
Document17 pagini
Radial Basis Function Neural Networks
o i
Încă nu există evaluări
Most Repeated Questions With Answers
Document8 pagini
Most Repeated Questions With Answers
Deepak Simhadri
100% (1)
Need of PCA
Document6 pagini
Need of PCA
Simi Jain
100% (1)
Cache-Assignment Handout 12
Document9 pagini
Cache-Assignment Handout 12
sch123321
Încă nu există evaluări
Telecom Customer Churn Project Report
Document25 pagini
Telecom Customer Churn Project Report
Sravanthi Ammu
50% (2)
Scheduling Using Genetic Algorithms
Document8 pagini
Scheduling Using Genetic Algorithms
ashish88bhardwaj_314
Încă nu există evaluări
Measuring Experimental Error in Microprocessor Simulation
Document12 pagini
Measuring Experimental Error in Microprocessor Simulation
blackspideymak
Încă nu există evaluări
Bike Sharing Predictive Modeling Using Random Forests
Document5 pagini
Bike Sharing Predictive Modeling Using Random Forests
Soumik Bhar
Încă nu există evaluări
9.1.0 Branch Prediction Pentiums IBM PPC
Document163 pagini
9.1.0 Branch Prediction Pentiums IBM PPC
shahbunaf
Încă nu există evaluări
Simulating Performance of Parallel Database Systems
Document6 pagini
Simulating Performance of Parallel Database Systems
KhacNam Nguyễn
Încă nu există evaluări
Programming Assignment #2: Decision Trees and Bagging for Data Classification
Document2 pagini
Programming Assignment #2: Decision Trees and Bagging for Data Classification
Y SAHITH
Încă nu există evaluări
Conceptual Cost Estimation of Pump Stations Projects Using Fuzzy Clustering
Document6 pagini
Conceptual Cost Estimation of Pump Stations Projects Using Fuzzy Clustering
wawan aries sandi
Încă nu există evaluări
Performance Monitoring: Database Overview Advanced Backup Techniques
Document39 pagini
Performance Monitoring: Database Overview Advanced Backup Techniques
austinfru
Încă nu există evaluări
ECE3502 IoT Time Series Analysis
Document11 pagini
ECE3502 IoT Time Series Analysis
Karthik Reddy
Încă nu există evaluări
Branch Prediction Techniques Survey
Document20 pagini
Branch Prediction Techniques Survey
Rachit Shah
Încă nu există evaluări
Pointer Analysis For Source-to-Source Transformations: (Marcio, Sedwards) @cs - Columbia.edu
Document10 pagini
Pointer Analysis For Source-to-Source Transformations: (Marcio, Sedwards) @cs - Columbia.edu
Hatam Mamundy
Încă nu există evaluări
Zhao ManualPovMap PDF
Document20 pagini
Zhao ManualPovMap PDF
Tempat Data
Încă nu există evaluări
Standard BI reports for purchasing functions
Document48 pagini
Standard BI reports for purchasing functions
suresh
Încă nu există evaluări
ProjectReport Kanwarpal
Document17 pagini
ProjectReport Kanwarpal
Kanwarpal Singh
Încă nu există evaluări
Benchmarking Slides
Document9 pagini
Benchmarking Slides
sarwan111291
Încă nu există evaluări
Difference between new() and new in SystemVerilog
Document9 pagini
Difference between new() and new in SystemVerilog
Ashwini Patil
Încă nu există evaluări
Assignment: Application of Graphs in Computer Programming
Document11 pagini
Assignment: Application of Graphs in Computer Programming
Saran Agarwal
Încă nu există evaluări
SAP BW Basics for Beginners
Document22 pagini
SAP BW Basics for Beginners
Swamy Katta
Încă nu există evaluări
Bootstrapping For Regressions in Stata 031017 PDF
Document20 pagini
Bootstrapping For Regressions in Stata 031017 PDF
pcg20013793
Încă nu există evaluări
Lab 11.4.3
Document9 pagini
Lab 11.4.3
bbhe_bbhe
Încă nu există evaluări
Evolutionary Algorithms for Food Science and Technology
De la Everand
Evolutionary Algorithms for Food Science and Technology
Evelyne Lutton
Încă nu există evaluări
Advanced C++ Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
De la Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
Încă nu există evaluări
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
De la Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
Încă nu există evaluări
Stat Quiz Ball
Document85 pagini
Stat Quiz Ball
Ma. Jessa Dane Ngoho
Încă nu există evaluări
Hypothesis Testing Results
Document9 pagini
Hypothesis Testing Results
THEA BEATRICE GARCIA
100% (1)
Value Relevance of Accounting Information
Document10 pagini
Value Relevance of Accounting Information
mimriyath
Încă nu există evaluări
STAT2112 Q2 Written Work 2 - Attempt Review
Document8 pagini
STAT2112 Q2 Written Work 2 - Attempt Review
RussianOmelete
Încă nu există evaluări
SC 331 Introduction to Biostatistics
Document6 pagini
SC 331 Introduction to Biostatistics
Aamir AnwarAli
Încă nu există evaluări
Statistics For Managers Using Microsoft Excel 3 Edition: Numerical Descriptive Measures
Document30 pagini
Statistics For Managers Using Microsoft Excel 3 Edition: Numerical Descriptive Measures
Angel Angia
Încă nu există evaluări
Summative Test On Mean, Variance and Standard Deviation
Document2 pagini
Summative Test On Mean, Variance and Standard Deviation
Mariepet Acantilado Cristuta-Agustines
Încă nu există evaluări
TChap2-sampling Dist & Confidence Interval
Document3 pagini
TChap2-sampling Dist & Confidence Interval
NurulFatimahalzahra
Încă nu există evaluări
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
Document34 pagini
QBM 101 Business Statistics: Department of Business Studies Faculty of Business, Economics & Accounting HE LP University
Vân Hải
Încă nu există evaluări
Statistics and Probability - Solved Assignments - Semester Fall 2007
Document27 pagini
Statistics and Probability - Solved Assignments - Semester Fall 2007
Muhammad Umair
50% (2)
MIST 7770 Midterm 1 Sample Questions and Exam Format
Document2 pagini
MIST 7770 Midterm 1 Sample Questions and Exam Format
Alyssa M
Încă nu există evaluări
Aghayedo, Elmi-Homework 6
Document3 pagini
Aghayedo, Elmi-Homework 6
faiza
Încă nu există evaluări
Chapter 6 - Statistical Quality Control (SQC)
Document45 pagini
Chapter 6 - Statistical Quality Control (SQC)
Vernier Miranda
Încă nu există evaluări
Res Hydrogomon
Document14 pagini
Res Hydrogomon
Keily Gonzales
Încă nu există evaluări
T-Test: T-TEST GROUPS X (1 2) /missing Analysis /variables Y /CRITERIA CI (.95)
Document2 pagini
T-Test: T-TEST GROUPS X (1 2) /missing Analysis /variables Y /CRITERIA CI (.95)
Anonymous RdDiyWN0w
Încă nu există evaluări
Estimasi Sumberdaya Batubara Pada Pit Timur, PT. Allied Indo Coal Jaya, Kota Sawahlunto, Sumatera Barat
Document18 pagini
Estimasi Sumberdaya Batubara Pada Pit Timur, PT. Allied Indo Coal Jaya, Kota Sawahlunto, Sumatera Barat
Panji Novendra
Încă nu există evaluări
Hypothesis of Two Population
Document122 pagini
Hypothesis of Two Population
jakery bar
Încă nu există evaluări
Problem Statement:: Field Characteristics Data Type
Document4 pagini
Problem Statement:: Field Characteristics Data Type
Aninda Dutta
Încă nu există evaluări
Reflection Math 1040 Project
Document2 pagini
Reflection Math 1040 Project
api-260891512
Încă nu există evaluări
2020, The Relationship Between General Aviation Pilot Age Andaccident Rate
Document11 pagini
2020, The Relationship Between General Aviation Pilot Age Andaccident Rate
abbas6063
Încă nu există evaluări
Answers: 1 Exploring Data
Document42 pagini
Answers: 1 Exploring Data
Fadi Al-Bzour
Încă nu există evaluări
Implementasi Metode Collaborative Learni A71a05b3
Document11 pagini
Implementasi Metode Collaborative Learni A71a05b3
Nurul fitrih
Încă nu există evaluări
Project Report - Sem 3
Document94 pagini
Project Report - Sem 3
Mathan Raj
100% (2)
Task Sheet #3 For Lesson 3 - Remoroza, Dinnah H.
Document4 pagini
Task Sheet #3 For Lesson 3 - Remoroza, Dinnah H.
dinnah
Încă nu există evaluări
Discriminant Analysis PDF
Document9 pagini
Discriminant Analysis PDF
Maninder Singh
Încă nu există evaluări
Summary MAS291
Document9 pagini
Summary MAS291
Hiếu Phạm
Încă nu există evaluări
Evaluation of Proficiency Test Data by Different Statistical Methods Comparison
Document10 pagini
Evaluation of Proficiency Test Data by Different Statistical Methods Comparison
biologyipbcc
100% (1)
NM
Document18 pagini
NM
Amrinderpal Singh
Încă nu există evaluări
Inferential Statistics Syllabus
Document2 pagini
Inferential Statistics Syllabus
nikhar singh
Încă nu există evaluări
The Impact of Remittances On Economic Growth in Bangladesh, India, Pakistan and Sri Lanka
Document21 pagini
The Impact of Remittances On Economic Growth in Bangladesh, India, Pakistan and Sri Lanka
Aiza khan
Încă nu există evaluări