Documente Academic
Documente Profesional
Documente Cultură
Pratik Zanke
1
1
Contents
1 Project Objective ............................................................................................................................................................ 3
2 Assumptions .................................................................................................................................................................. 3
5 Conclusion ....................................................................................................................................................................37
2
1 Project Objective
The target of the report is to investigate the Factor Hair dataset in R and create experiences about the
informational index. This investigation report will comprise of the accompanying:
• Importing the dataset in R
• Understanding the structure of dataset
• Exploratory investigation of the dataset
• Principal segment and factor examination of the dataset
• Simple and various direct regression
• Insights from the examination
2 Assumptions
The investigation and deductions from the contextual analysis is done dependent on the information
given in two informational indexes "Factor hair"). The following are the suppositions:
The Satisfaction is the reliant variable and ProdQual, Ecom, TechSup, CompRes, Advertising, ProdLine,
SalesFImage, ComPricing, WartyClaim, OrdBilling and DelSpeed are the autonomous factors
3
• readr : readr package provides a fast and friendly way to read rectangular data(.csv, .xlsx
etc.) and to flexible parse types of data
• dplyr: dplyr is a package which makes data manipulations easier
• pysch : package used for psychological, psychometric and personality research
• corrplot : corrplot is used for creating and visualizing correlation matrix
• ggplot2 : package used for creating elegant data visualizations, mostly graphs
• nFactors : package that helps in indices and strategies to determine the number of factors/
components to retain
• lattice : package is an implementation of Trellis graphics which provides high level data
visualization system with an emphasis on multivariate data
• tidyr : package that tidies the data to make it easy for visualization and modeling
• tidyverse : opinionated collection of R packages used for data sciences
• car : package that acts as a companion for applied regression and mainly used for anova tests
The command used to install package is:
4
3.2.1 Variable Identification – Inferences
• Install.package : The install.package command has been used to install all th packages as
mentioned in section 3.1.1
• Library: The library package has been used to load the installed packages
• setwd() : To set the working directory to import the data files. In the given case study, the
command has been set as
Code: setwd("F:/project")
• getwd() : The getwd command verifies the working directory which contains the data file to be
imported
• Dimensions: Checking the dimensionality, that is, number of rows and columns of the data set.
The code syntax is
dim(<fact >)
• Class: The next step is to check the type of entire data. The syntax used here is:
class(<fact>)
• Column Headings : Checking the names of the header or the column headings. The syntax used
here is:
name(<fact>)
5
• Missing values: Checking for missing values in dataset. The missing values refer to the “NA” or
blanks in the dataet.
The syntax for the code is :
any(is.na(<fact>))
• Structure of data set: The structure of dataset refers to the type of each of column of the
dataset. There are 13 columns in the dataset and the all the data types are different. The syntax
for the code is:
Str(<fact>)
The output tells that ID column is of “integer” type and the remaining columns are of “num”
type
• Summary and describe: Summary gives us the values of different interquartile ranges of the
dataset and describe command gives us the mean, median, standard deviation, min-mx ranges
and kurtosis in the dataset. The syntax used here is:
summary(<fact>)
describe(<fact>)
• Univariate graphs : Univariate graphs show relationship of single variable. There are 3 univariate
graphs for the dataset- histogram, density plot and box plot. In the given scenarios, the graphs
are plotted for the dependent variable, that is, Satisfaction. The syntax used here is:
histogram(<variablename>)
boxplot(<variablename >)
densityplot(<variablename>)
6
The graph outputs are given below:
7
• Bivariate graph : Bivariate graphs shows the spread of all the 13 independent variables of the
dataset. The graphs used here are histogram, boxplot and scatterplot. The syntax used here is:
Multi.hist(<datasetname>)
boxplot(<datasetname>)
Histogram
8
Boxplot
9
Scatterplot
• Outliers : Outliers are the values which do not fit in the standard range of data. Based on the
boxplot made, there are outliers in 4 independent variables. The syntax used here:
boxplot(<dataset$independentvar, plot = FALSE)$out
10
3.4 Multicollinearity
Multicollinearity is used to check the correlation between the independent variables of the dataset.
Multicollinearity, in a regression model, increases the standard errors of the coefficients. This makes some
independent variables statistically insignificant leading to a distorted regression model.
In the given dataset, we have 1 dependent variable (Satisfaction) and the rest 11 are independent
variables (excluding the ID).
Correlation Matrix: For finding the correlation coefficients and matrix, the below commands are used:
correlation = cor(fact1)
corrplot(<correlationmatrix>, method = "number")
The cor command is used to determine the correlation matrix and the corrplot is used to plot the
correlation coefficients. Below is the corrplot:
11
The above plot gives us the correlation between the variables in the given dataset and the conclusions are
as follows:
1. CompRes and DelSpeed are highly correlated (0.87)
2. OrdBilling and CompRes are highly correlated (0.76)
3. WartyClaim and TechSupport are highly correlated (0.8)
4. CompRes and OrdBilling are highly correlated (0.76)
5. OrdBilling and DelSpeed are highly correlated (0.75)
6. Ecom and SalesFImage are highly correlated (0.79)
Variance Inflation Factor: The second method is variance inflation factor(VIF). VIF is the measure of
amount of multicollinearity in multiple regression variables. It tells how much the behavior (variance) of
an independent variable is influenced, or inflated, by its interaction/correlation with the other
independent variables.
12
where lm is used to calculate the linear regression between the dependent variable and independent
variables.
The VIF numbers are greater than 1 in all the cases which proves that multicollinearity exists in the data
set. Also, the VIF value for DelSpeed is 6.51 and for CompRes is 4.73 which shows very high
multicollinearity.
Simple Linear regression is used to determine the correlation between the dependent and independent
variables in a dataset. In the “Factor hair” dataset, there are 11 dependent variables and Satisfaction as
the dependent variable. The SLR is represented by the below equation:
y = β0 + β1x1
where, y is the dependent variable, x is the independent variable and β0 and β1 are intercept and
variable coefficients respectively
The “loess.smooth” function is used to plot and add a smooth curve to the scatterplot and the col
function is used to add color to the regression line.
The linear regression is done using the lm() function against the imported dataset “fact”.
The linear regression has been performed on all the independent variables and the regression graphs have
been plotted (Code in the appendix). The linear model and the graph output are given below for all the
11 variables:
13
Linear model of Satisfaction and ProdQual
Product Estimate Std.Error Pr(>|t|) As per the table, the linear model equation
quality value is:
Intercept 3.67593 0.59765 6.151 1.68E- Satisfaction = 3.6759 + 0.4151 * ProdQual
08 Intercept (βo )= 3.6759
ProdQual 0.41512 0.07534 5.510 2.90E- Slope(β1) = 0.4151
07
R2 AdjustedR2 f-stat DF p-value Interpretation: or any one unit change in
0.2365 0.2287 30.36 1 and 2.90E- product quality, Satisfaction rating would
98 07 improve by 0.4151 keeping other things
2
Residual R = 1.047 on 98 DF constant as explained by model
14
Linear Model of Satisfaction and TechSup
TechSup Estimate Std.Error t value Pr(>|t|) As per the table, the linear model equation
Intercept 6.44757 0.43592 14.791 <2e-16 is:
TechSup 0.08768 0.07817 1.122 2.65E- Satisfaction=6.4475+0.08768*TechSup
01 Intercept (βo )= 6.4475
R2 AdjustedR2 f-stat DF p-value Slope(β1) = 0.08768
0.01268 0.002603 1.258 1 and 0.2647
98 Interpretation: or any one unit change in
Residual R2 = 1.19 on 98 DF TechSup, Satisfaction rating would improve
by 0.08768 keeping other things constant
as explained by model
15
Linear Model of Satisfaction and CompRes
Compres Estimate Std.Error t Pr(>|t|) As per the table, the linear model equation
value is:
Intercept 3.68005 0.44285 8.310 5.51E-13 Satisfaction=3.68005+0.59499*CompRes
CompRes 0.59499 0.07946 7.488 3.09E-11 Intercept (βo )= 3.68005
2 2
R AdjustedR f-stat DF p-value Slope(β1) = 0.59499
0.3639 0.3574 56.07 1 and 3.08E-11
98 Interpretation: or any one unit change in
Residual R square = 0.9554 on 98 DF CompRes, Satisfaction rating would
improve by 0.59499 keeping other things
constant as explained by model
16
Linear Model of Satisfaction and Advertising
Adver Estimate Std.Erro t value Pr(>|t|) As per the table, the linear model
r equation is:
Intercept 5.6259 0.4237 13.279 < 2e-16 Satisfaction=5.6259+0.3222*Adverti
Advertisin 0.3222 0.1018 3.167 2.06E-03 sing
g Intercept (βo )= 5.6259
R2 AdjustedR2 f-stat DF p-value Slope(β1) = 0.3222
0.09282 0.08357 10.03 1and 0.00205
98 6 Interpretation: or any one unit
Residual R2 = 1.141 on 98 DF change in Advertising, Satisfaction
rating would improve by 0.3222
keeping other things constant as
explained by model
17
Linear Model of Satisfaction and ProdLine
Prodline Estimate Std.Error t value Pr(>|t|) As per the table, the linear model
Intercept 4.02203 0.45471 8.845 3.87E- equation is:
14 Satisfaction=4.02202+0.49887*ProdLine
ProdLine 0.49887 0.07641 6.529 2.95E- Intercept (βo )= 4.02202
09 Slope(β1) = 0.49887
2 2
R AdjustedR f-stat DF p-value
0.3031 0.296 42.62 1 and 2.95E- Interpretation: or any one unit change
98 09 in ProdLine, Satisfaction rating would
2
Residual R = 1 on 98 df improve by 0.49887 keeping other
things constant as explained by model
18
Linear Model of Satisfaction and SalesFImage
Sales Estimate Std.Erro t Pr(>|t| As per the table, the linear model equation
r value ) is:
Intercept 4.06983 0.50874 8.000 2.54E- Satisfaction=4.06983+0.55596*SalesFImag
12 e
SalesfImag 0.55596 0.09722 5.719 1.16E- Intercept (βo )= 4.06983
e 07 Slope(β1) = 0.55596
2
R AdjustedR f-stat DF p-value
2
Interpretation: or any one unit change in
0.2502 0.2426 32.7 1 and 1.16E- SalesFImage, Satisfaction rating would
98 07 improve by 0.55596 keeping other things
Residual R2 = 1.037 on 98 DF constant as explained by model
19
Linear Model of Satisfaction and WartyClaim
Warnty Estimate Std.Error t value Pr(>|t|) As per the table, the linear model
Intercept 5.3581 0.8813 6.079 2.32E- equation is:
08 Satisfaction=5.3581+0.2581*WartyClaim
WartyClaim 0.2581 0.1445 1.786 7.72E- Intercept (βo )= 5.3581
02 Slope(β1) = 0.2581
R2 AdjustedR2 f-stat DF p-value
0.03152 0.02164 3.19 1and 7.72E- Interpretation: or any one unit change in
98 02 WartyClaim, Satisfaction rating would
2
Residual R = 1.179 on 98 DF decrease by 0.2581, keeping other things
constant as explained by model
20
Linear Model of Satisfaction and OrdBilling
Estimate Std.Error t value Pr(>|t|) As per the table, the linear model
Intercept 4.0541 0.484 8.377 3.96E-13 equation is:
OrdBilling 0.6695 0.1106 6.054 2.60E-08 Satisfaction=4.0541+0.6695*OrdBilling
R2 AdjustedR2 f-stat DF p-value Intercept (βo )= 4.0541
0.2722 0.2648 36.65 1and 2.60E-08 Slope(β1) = 0.6695
98
Residual R2= 1.022 on 98 DF Interpretation: or any one unit change
in OrdBilling, Satisfaction rating would
decrease by 0.6695, keeping other
things constant as explained by model
21
3.6 Principal Component Analysis and Factor Analysis
The next part of the dataset is to perform Principal Component Analysis and Factor Analysis on the given
dataset.
Factor Analysis (FA), on the other hand, is quite like Principal component analysis. This also involves
shrinking the dataset into smaller dataset, that is more manageable.
The main difference between PCA and FA is that, the components in PCA are linear combinations of the
original variables whereas, in FA, the original variables are defined as combination of factors. The Principal
Component Analysis reduces the data into a smaller number of components and factor analysis depicts
what constructs underlie the data.
To check if the dataset is suitable for PCA/FA, we use the below test:
a) Bartlett’s test of Sphericity
b) Kaiser-Meyer Olkin test (KMO)
Bartlett’s test of Sphericity: In this test, the correlation matrix is compared against an identity matrix,
which would indicate that your variables are unrelated and therefore not suitable for structure detection.
A p-values less than significance level, alpha (0.05) of the significance level indicate that a factor analysis
will be useful for the data.
22
The output is:
Chi-Square p value DF
619.2726 1.793E-96 55
The p-value in the test is 1.79e-96, which is less than level of significance (alpha = 0.05) which
determines that factor analysis is suitable for the dataset.
Kaiser-Meyer Olkin test (KMO): The Kaiser-Meyer-Olkin Measure of Sampling Adequacy is a statistic that
indicates the proportion of variance in your variables that might be caused by underlying factors. If the
value is less than 0.50, the results of the factor analysis probably won't be very useful.
The test has been conducted on the “fact” dataset. The syntax used is:
KMO (<correlation matrix of the dataset>)
The overall mean of KMO test is 0.65, which is greater than 0.5. Hence, PCA/FA is considered appropriate.
The first step towards dimensionality reduction is calculation of Eigen Vectors and eigen values, which is
the core of PCA and FA. The eigen vectors determine the direction of new space and eigen values
determine the magnitude of the vectors. Eigen vectors and values are calculated on the correlation matrix.
EIGEN VALUES
3.42697133 2.55089671 1.69097648 1.08655606
0.60942409 0.55188378 0.40151815 0.24695154
0.20355327 0.13284158 0.09842702
23
Scree Plot:
A scree plot is used to determine the number of factors to be used for PCA and FA. A scree plot shows the
eigenvalues on the y-axis and the number of factors on the x-axis. The point where the slope of the curve
is clearly leveling off (the bend elbow rule) indicates the number of factors that should be generated by
the analysis.
According to Kaiser-Guttman normalization and Bend Elbow rule, the graph points above 1 against the y-
axis gives the number of factors.
plot(<name of scree plot>, col = <color of dots>,xlab = <labels on x-axis>, ylab = <labels on y axis>)
lines(<name of scree plot>, col = <color of connecting line>)
abline (<height from where line is drawn>,col = <color of line>)
Scree Plot
24
Without rotation: principal(<datasetname>,nfactors = 4,rotate = "none")
With orthogonal rotation: principal(<datasetname>,nfactors = 4,rotate = "varimax")
In the R code, the unrotated model and rotated model has been assigned as pcamodel1 and pcamodel2
respectively.
PCA results are typically interpreted in terms of the major loadings on each factor. These structures may
be represented as a table of loadings or graphically, where all loadings with an absolute value > some cut
point.
Here the cut-off point is considered as 0.3 and below are the results.
The obtained loadings are filtered with a cut off of 0.3 and the below is the obtained table.
25
FA Rotated Loadings
RC1 RC2 RC3 RC4 Factor 1 accounts for 24% of the variance; Factor
ProdQual 0.647 2 accounts for 17.9% of the variance; Factor 3
Ecom 0.787 accounts for 14.9% of the variance; Factor 4
TechSup 0.883 accounts for 12.5% of the variance.
All the factors together explain 69.3% of the
CompRes 0.898
variance.
Advertising 0.530
ProdLine 0.525 0.712
SalesFImage 0.971
ComPricing -0.590
WartyClaim 0.885
OrdBilling 0.768
DelSpeed 0.949
RC1 RC2 RC3 RC4
SS loadings 2.635 1.967 1.641 1.371
Proportion Var 0.240 0.179 0.149 0.125
Cumulative Var 0.240 0.418 0.568 0.692
The obtained loadings are grouped in components with the help of a fa diagram. The fa diagram uses the
loadings and converts them into 4 factors.
26
Principal Component analysis
Factor Analysis
27
Post the PCA/FA, the PCA/FA scores are combined with the “Satisfaction” scores from the parent dataset
(“fact” in the given scenario) to create a dataframe for regression modelling (assigned as pcaregdata and
faregdata for PCA and FA respectively).
To perform the regressing modelling, the factors obtained in PCA and FA are renamed according to the
below table:
Multiple Linear Regression is an extension of simple linear regression. MLR tells the relationship between
the dependent variable and multiple independent continuous variable. In the given scenario, the
dependent variable is “Satisfaction” and the 4 factors obtained in PCA/FA.
where, y is the dependent variable, x1,x2,…xk is the independent variable and β0 β1, β2, β3… βk are
intercept and variable coefficients respectively.
In the report, the regression has been done with and without splitting
28
i) Regression modelling without data splitting
29
Backtracking plot
A backtracking graph is plotted between the satisfaction value predicted from the regression and those
from the actual dataset. The strength of the regression modelling depends on how close the coordinates
of predicted value are with actual values.
PCA Backtrack
FA Backtrack
The actual values are plotted with red lines and the predicted values are plotted with blue color.
30
b) Interpretations from the regression model:
• Multiple R2: The values of multiple R2 for PCA and FA are 66.07 and 69.74 respectively. This
implies that 66.07% and 69.74% of variation is explained by the independent variables
• Adjusted R2: The adjusted R2 compensates for the addition of variables and only increases if the
new predictor enhances the model. The adjusted R2 is a modified version of R2 for the number of
predictors in a model. The adjusted R2 value in PCA and FA model (64.64% and 68.46%
respectively) explains the variability in the model.
• Degrees of Freedom: The total number of observations in dataset is 100 and degrees of
freedom(df) of dataset is 99 (100-1). The total number of independent factors is 5 and df is 4 (5-
1). Thereby, the total df for error is (99-4), that is, 95 df
• F statistic and p value: Overall p-value (2.2e-16) of Model given by F-statistic gives evidence
against the null-hypothesis. Model is significantly valid at this point.
• Backtracking model: As we can see, the lines in the backtracking model are almost overlapping,
which implies that actual and predicted values are almost same.
The dataset is split into train and test data, where train is 70% of the main data and test constitutes the
remaining 30%. The syntax used for splitting is:
set.seed(100)
sample(1:nrow(<datasetname>), 0.7*nrow(<datasetname>))
The regression tables for PCA (train and test) and FA (train and test) are given below:
31
Regression Equation: Regression Equation:
Satisfaction = 6.89394 + (0.56130 * Purchase) + Satisfaction = 7.15569+ (0.93570 * Purchase) +
(0.59052 * Marketing) + (0.23133 * Marketing) +
(0.09847 * Afterservice) + (-0.01100 * Afterservice) +
(0.47035 * ProductQual) (0.85112 * ProductQual)
Backtracking model
32
PCA Train Model
33
PCA Test Model
34
FA Train Model
35
FA Test Model
• Multiple R2: The values of multiple R2 for PCA train and test data are 65.36% and
86.95% respectively. This implies that 65.36% and 86.95% of variation is explained by
the independent variables. Similarly, 67.67% and 76.58% of variation is explained for
the FA train and test data respectively.
• Adjusted R2: The adjusted R2 value in PCA train (63.23%) and test (84.86%) explains
the variability of the model and FA train (65.47%) and test (72.83%) model explains
the variability in the FA models.
• Degrees of Freedom: The total number of observations in the train datasets for PCA
and FA is 70 and degrees of freedom(df) of dataset is 69 (70-1). The total number of
independent factors is 5 and df is 4 (5-1). Thereby, the total df for error is (69-4), that
is, 65 df. Similarly, the number errors in the test data sets is 25 (29-4)
• F statistic and p value: Overall p-values of all the 4 models are given by F-statistic are
extremely small (to the power of -16), that gives evidence against the null-
hypothesis. Model is significantly valid at this point.
• Backtracking model: As we can see, the lines in the backtracking model are almost
overlapping, which implies that actual and predicted values are almost same.
36
5 Conclusion
• The given dataset "Factor Hair" has 11 free factors, specifically, ProdQual, Ecom, TechSup,
CompRes, Advertising, ProdLine, SalesFImage, ComPricing, WartyClaim, OrdBilling and DelSpeed. The
Satisfaction is the needy variable.
• Exploratory information examination has been done to discover the class, structure,
dimensionality, outline and to plot the univariate and bivariate graphs (histogram, boxplot, thickness plot,
scatterplot)
• The dataset was broken down for multicollinearity and dependent on the test, there was
connection between numerous autonomous factors. This suggested, to, complete numerous regression,
the autonomous factors must be decreased to factors
• Simple Linear Regression has been performed for the needy variable against every free factor.
The straight regression model and diagrams were inferred.
• The 11 autonomous factors were exposed to Principal Component Analysis and Factor Analysis to
decrease their dimensionalities. In view of the PCA/FA results, these factors were changed over into 4
components.
• These 4 components were utilized for playing out numerous direct regression against with the
reliant variable and plot the backtracking chart.
• Based on the above figuring, it has been reasoned that the 4 autonomous variables clarify the
regression in the information by a decent edge. Further, the anticipated qualities cover the genuine
qualities which shows the legitimacy of the regression model.
37