Sunteți pe pagina 1din 37

Factor Hair Revised

Pratik Zanke
1

1
Contents
1 Project Objective ............................................................................................................................................................ 3

2 Assumptions .................................................................................................................................................................. 3

3 Exploratory Data Analysis – Step by step approach ........................................................................................................ 3

3.1 Environment Set up and Data Import .................................................................................................................... 3

3.1.1 Install necessary Packages and Invoke Libraries ............................................................................................... 3

3.1.2 Set up working Directory .................................................................................................................................. 4

3.1.3 Import and Read the Dataset ........................................................................................................................... 4


3.2 Variable Identification .......................................................................................................................................... 4

3.2.1 Variable Identification – Inferences .................................................................................................................. 5

3.3 Exploratory Analysis.............................................................................................................................................. 5

3.4 Multicollinearity ..................................................................................................................................................11

3.5 Simple Linear Regression (SLR) .............................................................................................................................13

3.6 Principal Component Analysis and Factor Analysis ...............................................................................................22

3.6.1 Principal Component Analysis:.........................................................................................................................24


3.6.2 Factor Analysis: ...............................................................................................................................................25

3.6.3 PCA/FA Factor Grouping ..................................................................................................................................26

4 Multiple Linear Regression (MLR):.................................................................................................................................28

4.1.1 Regression Modeling: ......................................................................................................................................28

i) Regression modelling without data splitting ........................................................................................................29

a) Checking the validity of the regression model: .....................................................................................................29

b) Interpretations from the regression model: ..........................................................................................................31

ii) Regression modelling with train and test data: ....................................................................................................31

a) Checking the validity of the regression model: .....................................................................................................32


b) Interpretations from the regression model: ..........................................................................................................36

5 Conclusion ....................................................................................................................................................................37

2
1 Project Objective
The target of the report is to investigate the Factor Hair dataset in R and create experiences about the
informational index. This investigation report will comprise of the accompanying:
• Importing the dataset in R
• Understanding the structure of dataset
• Exploratory investigation of the dataset
• Principal segment and factor examination of the dataset
• Simple and various direct regression
• Insights from the examination

2 Assumptions
The investigation and deductions from the contextual analysis is done dependent on the information
given in two informational indexes "Factor hair"). The following are the suppositions:

The Satisfaction is the reliant variable and ProdQual, Ecom, TechSup, CompRes, Advertising, ProdLine,
SalesFImage, ComPricing, WartyClaim, OrdBilling and DelSpeed are the autonomous factors

• Data is regularly dispersed


• The connection among free and ward factors is direct
• Significance level is taken as 0.05

3 Exploratory Data Analysis – Step by step approach

A Typical Data investigation movement comprises of the accompanying advances:


1. Environment Set up and Data Import
2. Variable Identification
3. Univariate Analysis
4. Bi-Variate Analysis
5. Finding missing worth and the anomalies
6. Checking for multicollinearity in the dataset
7. Simple direct regression of ward variable with different free factor
8. Perform Principal part/Factor investigation (PCA/FA) on the dataset
9. Perform various direct regression with the got PCA factors
We will follow these means in investigating the gave dataset.

3.1 Environment Set up and Data Import

3.1.1 Install necessary Packages and Invoke Libraries


To perform descriptive statistics on the given data, we have installed and invoked the below
packages:

3
• readr : readr package provides a fast and friendly way to read rectangular data(.csv, .xlsx
etc.) and to flexible parse types of data
• dplyr: dplyr is a package which makes data manipulations easier
• pysch : package used for psychological, psychometric and personality research
• corrplot : corrplot is used for creating and visualizing correlation matrix
• ggplot2 : package used for creating elegant data visualizations, mostly graphs
• nFactors : package that helps in indices and strategies to determine the number of factors/
components to retain
• lattice : package is an implementation of Trellis graphics which provides high level data
visualization system with an emphasis on multivariate data
• tidyr : package that tidies the data to make it easy for visualization and modeling
• tidyverse : opinionated collection of R packages used for data sciences
• car : package that acts as a companion for applied regression and mainly used for anova tests
The command used to install package is:

install.packages (”<package name>”)

3.1.2 Set up working Directory


Setting a working directory on starting of the R session makes importing and exporting data files
and code files easier. Basically, working directory is the location/ folder on the PC where you have
the data, codes etc. related to the project.

Please refer Appendix A for Source Code.


3.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the file.
Please refer Appendix A for Source Code.

3.2 Variable Identification


The case study shows the amalgamation of various functions to cover the given activities starting
from exploratory analysis to multiple linear regression:

• Environment setup: install.packages(), library(), setwd(), getwd(), read.csv


• Exploratory Analysis: dim(), any(is.na()), class(), names(), describe(), str(), attach(),
summary(), histogram(), densityplot(), multi.hist(), $out, cor(), corrplot(), plot()
• Multicollinearity: lm(), vif(), summary(), cor(), corrplot()
• Simple Linear Regression: lm(), loess.smooth()
• Principal Component Analysis/ Factor Analysis:
eigen(),principal(),plot(),lines(),fa.diagram(),print(),
as.data.frame(),round(),set.seed(),lm(),predict(),cbind(), colnames()
• Multiple Linear Regression: lm(), plot(),lines(),data.frame()

4
3.2.1 Variable Identification – Inferences

• Install.package : The install.package command has been used to install all th packages as
mentioned in section 3.1.1
• Library: The library package has been used to load the installed packages
• setwd() : To set the working directory to import the data files. In the given case study, the
command has been set as
Code: setwd("F:/project")

• getwd() : The getwd command verifies the working directory which contains the data file to be
imported

Note: Please refer to the code and output in the appendix.

3.3 Exploratory Analysis


• Importing dataset : The read.csv command has been used to read the data files for the given
questions.
Code: fact= read.csv("Factor hair.csv", header = TRUE)
Output : fact = import of file with 100 obs. of 13 variables

• Dimensions: Checking the dimensionality, that is, number of rows and columns of the data set.
The code syntax is
dim(<fact >)

The output of the code is: 100 and 13


This implies that the dataset has 100 rows and 13 columns.

• Class: The next step is to check the type of entire data. The syntax used here is:
class(<fact>)

The output is: data.frame


This implies that the dataset has different types of data combined in a dataframe.

• Column Headings : Checking the names of the header or the column headings. The syntax used
here is:
name(<fact>)

The output is:

"ID" "ProdQual" "Ecom" "TechSup"


"CompRes" "Advertising" "ProdLine" "SalesFImage"
"ComPricing" "WartyClaim" "OrdBilling" DelSpeed
"Satisfaction"

5
• Missing values: Checking for missing values in dataset. The missing values refer to the “NA” or
blanks in the dataet.
The syntax for the code is :
any(is.na(<fact>))

The output is : FALSE


This implies that there are no missing values in the dataset.

• Structure of data set: The structure of dataset refers to the type of each of column of the
dataset. There are 13 columns in the dataset and the all the data types are different. The syntax
for the code is:
Str(<fact>)

The output tells that ID column is of “integer” type and the remaining columns are of “num”
type

• Summary and describe: Summary gives us the values of different interquartile ranges of the
dataset and describe command gives us the mean, median, standard deviation, min-mx ranges
and kurtosis in the dataset. The syntax used here is:
summary(<fact>)
describe(<fact>)

• Univariate graphs : Univariate graphs show relationship of single variable. There are 3 univariate
graphs for the dataset- histogram, density plot and box plot. In the given scenarios, the graphs
are plotted for the dependent variable, that is, Satisfaction. The syntax used here is:
histogram(<variablename>)
boxplot(<variablename >)
densityplot(<variablename>)

6
The graph outputs are given below:

The histogram is inclined


towards the right side, that is,
it is right skewed.

Similarly, the density plot is


inclined towards the right
side, that is, it is right
skewed.

The boxplot is skewed towards top.


When we rotate plot clockwise, it will
also be skewed towards right. inclined
towards the right side, that is, it is
right skewed.

7
• Bivariate graph : Bivariate graphs shows the spread of all the 13 independent variables of the
dataset. The graphs used here are histogram, boxplot and scatterplot. The syntax used here is:
Multi.hist(<datasetname>)
boxplot(<datasetname>)

The graph outputs are given below:

Histogram

8
Boxplot

9
Scatterplot

• Outliers : Outliers are the values which do not fit in the standard range of data. Based on the
boxplot made, there are outliers in 4 independent variables. The syntax used here:
boxplot(<dataset$independentvar, plot = FALSE)$out

This is calculated for each independent variable and output is:

ecom sales ordbill delsp

5.6 7.8 6.7 1.6


5.7 7.8 6.5 1.6
5.1 8.2 2 1.6
5.1 7.8 2 1.6
5.1 7.8 6.7 1.6
5.5 8.2 6.5 1.6

10
3.4 Multicollinearity
Multicollinearity is used to check the correlation between the independent variables of the dataset.
Multicollinearity, in a regression model, increases the standard errors of the coefficients. This makes some
independent variables statistically insignificant leading to a distorted regression model.

There are 2 ways to detect multicollinearity in a dataset.


a. Correlation Matrix
b. Variance Inflation Factor (VIF)

In the given dataset, we have 1 dependent variable (Satisfaction) and the rest 11 are independent
variables (excluding the ID).
Correlation Matrix: For finding the correlation coefficients and matrix, the below commands are used:
correlation = cor(fact1)
corrplot(<correlationmatrix>, method = "number")

The cor command is used to determine the correlation matrix and the corrplot is used to plot the
correlation coefficients. Below is the corrplot:

11
The above plot gives us the correlation between the variables in the given dataset and the conclusions are
as follows:
1. CompRes and DelSpeed are highly correlated (0.87)
2. OrdBilling and CompRes are highly correlated (0.76)
3. WartyClaim and TechSupport are highly correlated (0.8)
4. CompRes and OrdBilling are highly correlated (0.76)
5. OrdBilling and DelSpeed are highly correlated (0.75)
6. Ecom and SalesFImage are highly correlated (0.79)

Variance Inflation Factor: The second method is variance inflation factor(VIF). VIF is the measure of
amount of multicollinearity in multiple regression variables. It tells how much the behavior (variance) of
an independent variable is influenced, or inflated, by its interaction/correlation with the other
independent variables.

The command for calculating vif:

model1 = lm(Satisfaction~.,data = fact1)


vif(model1)

12
where lm is used to calculate the linear regression between the dependent variable and independent
variables.

The output is as below:

ProdQual Ecom TechSup CompRes Advertising


1.635797 2.756694 2.976796 4.730448 1.508933
ProdLine SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
3.488185 3.439420 1.635000 3.198337 2.902999 6.516014

The VIF numbers are greater than 1 in all the cases which proves that multicollinearity exists in the data
set. Also, the VIF value for DelSpeed is 6.51 and for CompRes is 4.73 which shows very high
multicollinearity.

3.5 Simple Linear Regression (SLR)

Simple Linear regression is used to determine the correlation between the dependent and independent
variables in a dataset. In the “Factor hair” dataset, there are 11 dependent variables and Satisfaction as
the dependent variable. The SLR is represented by the below equation:

y = β0 + β1x1

where, y is the dependent variable, x is the independent variable and β0 and β1 are intercept and
variable coefficients respectively

The R syntaxes for the regression analysis are:


SLR : lm( <dependent var ~ independent var>)
Scatterplot :plot(<dependent var , independent var, data = “imported dataset”>)

Regression line:lines(loess.smooth(<dependent var , independent var, col = “regression line color”>))

The “loess.smooth” function is used to plot and add a smooth curve to the scatterplot and the col
function is used to add color to the regression line.

The linear regression is done using the lm() function against the imported dataset “fact”.

The linear regression has been performed on all the independent variables and the regression graphs have
been plotted (Code in the appendix). The linear model and the graph output are given below for all the
11 variables:

13
Linear model of Satisfaction and ProdQual
Product Estimate Std.Error Pr(>|t|) As per the table, the linear model equation
quality value is:
Intercept 3.67593 0.59765 6.151 1.68E- Satisfaction = 3.6759 + 0.4151 * ProdQual
08 Intercept (βo )= 3.6759
ProdQual 0.41512 0.07534 5.510 2.90E- Slope(β1) = 0.4151
07
R2 AdjustedR2 f-stat DF p-value Interpretation: or any one unit change in
0.2365 0.2287 30.36 1 and 2.90E- product quality, Satisfaction rating would
98 07 improve by 0.4151 keeping other things
2
Residual R = 1.047 on 98 DF constant as explained by model

Linear Model of Satisfaction and Ecom


Ecom Estimate Std.Error t Pr(>|t|) As per the table, the linear model
value equation is:
Intercept 5.1516 0.6161 8.361 4.28E-13 Satisfaction=5.1516+0.4811*Ecom
Ecom 0.4811 0.1649 2.918 4.37E-03 Intercept (βo )= 5.1516
R2 AdjustedR2 f-stat DF p-value Slope(β1) = 0.4811
0.07994 0.07056 8.515 1 and 0.004368
98 Interpretation: or any one unit change in
Residual R2= 1.149 on 98 DF Ecommerce, Satisfaction rating would
improve by 0.4811 keeping other things
constant as explained by model

14
Linear Model of Satisfaction and TechSup
TechSup Estimate Std.Error t value Pr(>|t|) As per the table, the linear model equation
Intercept 6.44757 0.43592 14.791 <2e-16 is:
TechSup 0.08768 0.07817 1.122 2.65E- Satisfaction=6.4475+0.08768*TechSup
01 Intercept (βo )= 6.4475
R2 AdjustedR2 f-stat DF p-value Slope(β1) = 0.08768
0.01268 0.002603 1.258 1 and 0.2647
98 Interpretation: or any one unit change in
Residual R2 = 1.19 on 98 DF TechSup, Satisfaction rating would improve
by 0.08768 keeping other things constant
as explained by model

15
Linear Model of Satisfaction and CompRes
Compres Estimate Std.Error t Pr(>|t|) As per the table, the linear model equation
value is:
Intercept 3.68005 0.44285 8.310 5.51E-13 Satisfaction=3.68005+0.59499*CompRes
CompRes 0.59499 0.07946 7.488 3.09E-11 Intercept (βo )= 3.68005
2 2
R AdjustedR f-stat DF p-value Slope(β1) = 0.59499
0.3639 0.3574 56.07 1 and 3.08E-11
98 Interpretation: or any one unit change in
Residual R square = 0.9554 on 98 DF CompRes, Satisfaction rating would
improve by 0.59499 keeping other things
constant as explained by model

16
Linear Model of Satisfaction and Advertising

Adver Estimate Std.Erro t value Pr(>|t|) As per the table, the linear model
r equation is:
Intercept 5.6259 0.4237 13.279 < 2e-16 Satisfaction=5.6259+0.3222*Adverti
Advertisin 0.3222 0.1018 3.167 2.06E-03 sing
g Intercept (βo )= 5.6259
R2 AdjustedR2 f-stat DF p-value Slope(β1) = 0.3222
0.09282 0.08357 10.03 1and 0.00205
98 6 Interpretation: or any one unit
Residual R2 = 1.141 on 98 DF change in Advertising, Satisfaction
rating would improve by 0.3222
keeping other things constant as
explained by model

17
Linear Model of Satisfaction and ProdLine
Prodline Estimate Std.Error t value Pr(>|t|) As per the table, the linear model
Intercept 4.02203 0.45471 8.845 3.87E- equation is:
14 Satisfaction=4.02202+0.49887*ProdLine
ProdLine 0.49887 0.07641 6.529 2.95E- Intercept (βo )= 4.02202
09 Slope(β1) = 0.49887
2 2
R AdjustedR f-stat DF p-value
0.3031 0.296 42.62 1 and 2.95E- Interpretation: or any one unit change
98 09 in ProdLine, Satisfaction rating would
2
Residual R = 1 on 98 df improve by 0.49887 keeping other
things constant as explained by model

18
Linear Model of Satisfaction and SalesFImage
Sales Estimate Std.Erro t Pr(>|t| As per the table, the linear model equation
r value ) is:
Intercept 4.06983 0.50874 8.000 2.54E- Satisfaction=4.06983+0.55596*SalesFImag
12 e
SalesfImag 0.55596 0.09722 5.719 1.16E- Intercept (βo )= 4.06983
e 07 Slope(β1) = 0.55596
2
R AdjustedR f-stat DF p-value
2
Interpretation: or any one unit change in
0.2502 0.2426 32.7 1 and 1.16E- SalesFImage, Satisfaction rating would
98 07 improve by 0.55596 keeping other things
Residual R2 = 1.037 on 98 DF constant as explained by model

Linear Model of Satisfaction and ComPricing


Comprice Estimate Std.Error t value Pr(>|t|) As per the table, the linear model equation is:
Intercept 8.03856 0.54427 14.769 <2e-16 Satisfaction=8.03856+(-
ComPricing -0.16068 0.07621 -2.108 0.0376 0.16068)*ComPricing
R2 AdjustedR2 f-stat DF p-value Intercept (βo )= 8.03856
0.04339 0.03363 4.445 0.03756 Slope(β1) = -0.16068
2
Residual R = 1.172 on 98 DF
Interpretation: or any one unit change in
ComPricing, Satisfaction rating would
decrease by 0.55596 and vice versa, keeping
other things constant as explained by model

19
Linear Model of Satisfaction and WartyClaim
Warnty Estimate Std.Error t value Pr(>|t|) As per the table, the linear model
Intercept 5.3581 0.8813 6.079 2.32E- equation is:
08 Satisfaction=5.3581+0.2581*WartyClaim
WartyClaim 0.2581 0.1445 1.786 7.72E- Intercept (βo )= 5.3581
02 Slope(β1) = 0.2581
R2 AdjustedR2 f-stat DF p-value
0.03152 0.02164 3.19 1and 7.72E- Interpretation: or any one unit change in
98 02 WartyClaim, Satisfaction rating would
2
Residual R = 1.179 on 98 DF decrease by 0.2581, keeping other things
constant as explained by model

20
Linear Model of Satisfaction and OrdBilling
Estimate Std.Error t value Pr(>|t|) As per the table, the linear model
Intercept 4.0541 0.484 8.377 3.96E-13 equation is:
OrdBilling 0.6695 0.1106 6.054 2.60E-08 Satisfaction=4.0541+0.6695*OrdBilling
R2 AdjustedR2 f-stat DF p-value Intercept (βo )= 4.0541
0.2722 0.2648 36.65 1and 2.60E-08 Slope(β1) = 0.6695

98
Residual R2= 1.022 on 98 DF Interpretation: or any one unit change
in OrdBilling, Satisfaction rating would
decrease by 0.6695, keeping other
things constant as explained by model

Linear Model of Satisfaction and DelSpeed


Delsp Estimate Std.Error t value Pr(>|t|) As per the table, the linear model
Intercept 3.2791 0.5294 6.194 1.38E- equation is:
08 Satisfaction=3.2791+0.9364*DelSpeed
Delspeed 0.9364 0.1339 6.994 3.30E- Intercept (βo )= 3.2791
10 Slope(β1) = 0.9364
R2 AdjustedR2 f-stat DF p-value
0.333 0.3262 48.92 3.30E- Interpretation: or any one unit change in
10 DelSpeed, Satisfaction rating would
2
Residual R = 0.9783 on 98 DF decrease by 0.9364, keeping other things
constant as explained by model

21
3.6 Principal Component Analysis and Factor Analysis

The next part of the dataset is to perform Principal Component Analysis and Factor Analysis on the given
dataset.

Principal Component Analysis (PCA) is a method of dimensionality reduction. By dimensionality reduction,


the point is, the independent correlated variables in the dataset are grouped or clustered into common
factors. This is done by transforming the variables into new set of components, known as the Principal
components.

Factor Analysis (FA), on the other hand, is quite like Principal component analysis. This also involves
shrinking the dataset into smaller dataset, that is more manageable.

The main difference between PCA and FA is that, the components in PCA are linear combinations of the
original variables whereas, in FA, the original variables are defined as combination of factors. The Principal
Component Analysis reduces the data into a smaller number of components and factor analysis depicts
what constructs underlie the data.

To check if the dataset is suitable for PCA/FA, we use the below test:
a) Bartlett’s test of Sphericity
b) Kaiser-Meyer Olkin test (KMO)
Bartlett’s test of Sphericity: In this test, the correlation matrix is compared against an identity matrix,
which would indicate that your variables are unrelated and therefore not suitable for structure detection.
A p-values less than significance level, alpha (0.05) of the significance level indicate that a factor analysis
will be useful for the data.

The test has been conducted on the “fact” dataset.


The syntax used is:
cortest.bartlett()

22
The output is:
Chi-Square p value DF
619.2726 1.793E-96 55

The p-value in the test is 1.79e-96, which is less than level of significance (alpha = 0.05) which
determines that factor analysis is suitable for the dataset.

Kaiser-Meyer Olkin test (KMO): The Kaiser-Meyer-Olkin Measure of Sampling Adequacy is a statistic that
indicates the proportion of variance in your variables that might be caused by underlying factors. If the
value is less than 0.50, the results of the factor analysis probably won't be very useful.

The test has been conducted on the “fact” dataset. The syntax used is:
KMO (<correlation matrix of the dataset>)

The output of KMO test is:

Overall Mean = 0.65


ProdQual TechSup CompRes Advertising
Ecom
0.51 0.63 0.52 0.79 0.78
ProdLine SalesFImage ComPricing WartyClaim OrdBilling DelSpeed
0.62 0.62 0.75 0.51 0.76 0.67

The overall mean of KMO test is 0.65, which is greater than 0.5. Hence, PCA/FA is considered appropriate.

Eigen Values and Vectors:

The first step towards dimensionality reduction is calculation of Eigen Vectors and eigen values, which is
the core of PCA and FA. The eigen vectors determine the direction of new space and eigen values
determine the magnitude of the vectors. Eigen vectors and values are calculated on the correlation matrix.

The syntax is as follows:


Eigen Vector = eigen(cor(<name of dataset))
Eigen Value = <eigenvectorname>$values

The output of the eigen values are:

EIGEN VALUES
3.42697133 2.55089671 1.69097648 1.08655606
0.60942409 0.55188378 0.40151815 0.24695154
0.20355327 0.13284158 0.09842702

23
Scree Plot:
A scree plot is used to determine the number of factors to be used for PCA and FA. A scree plot shows the
eigenvalues on the y-axis and the number of factors on the x-axis. The point where the slope of the curve
is clearly leveling off (the bend elbow rule) indicates the number of factors that should be generated by
the analysis.
According to Kaiser-Guttman normalization and Bend Elbow rule, the graph points above 1 against the y-
axis gives the number of factors.

The syntax for scree plot:

plot(<name of scree plot>, col = <color of dots>,xlab = <labels on x-axis>, ylab = <labels on y axis>)
lines(<name of scree plot>, col = <color of connecting line>)
abline (<height from where line is drawn>,col = <color of line>)

The output graph is:

Scree Plot

Based on the above graph, the number of factors to be considered are 4.


Therefore, the 11 independent variables will be reduced to 4 factors.

3.6.1 Principal Component Analysis:


The syntax used for Principal Component analysis is :

24
Without rotation: principal(<datasetname>,nfactors = 4,rotate = "none")
With orthogonal rotation: principal(<datasetname>,nfactors = 4,rotate = "varimax")

In the R code, the unrotated model and rotated model has been assigned as pcamodel1 and pcamodel2
respectively.

PCA results are typically interpreted in terms of the major loadings on each factor. These structures may
be represented as a table of loadings or graphically, where all loadings with an absolute value > some cut
point.
Here the cut-off point is considered as 0.3 and below are the results.

PCA Rotated Loadings


RC1 RC2 RC3 RC4 Factor 1 accounts for 26.30% of the variance;
ProdQual 0.876 Factor 2 accounts for 20.30% of the variance;
Ecom 0.871 Factor 3 accounts for 16.90% of the variance;
TechSup 0.939 Factor 4 accounts for 1.6% of the variance.
All the factors together explain 79% of the
CompRes 0.926
variance.
Advertising 0.742
ProdLine 0.591 0.642
SalesFImage 0.900
ComPricing -0.723
WartyClaim 0.931
OrdBilling 0.864
DelSpeed 0.938
RC1 RC2 RC3 RC4
SS loadings 2.893 2.234 1.856 1.774
Proportion Var 0.263 0.203 0.169 0.161
Cumulative Var 0.263 0.466 0.635 0.796

3.6.2 Factor Analysis:

The syntax used for Factor analysis (with rotation) is:

fa(r=<datasetname>,nfactors = 4,rotate = "varimax",fm = "pa")

The obtained loadings are filtered with a cut off of 0.3 and the below is the obtained table.

25
FA Rotated Loadings
RC1 RC2 RC3 RC4 Factor 1 accounts for 24% of the variance; Factor
ProdQual 0.647 2 accounts for 17.9% of the variance; Factor 3
Ecom 0.787 accounts for 14.9% of the variance; Factor 4
TechSup 0.883 accounts for 12.5% of the variance.
All the factors together explain 69.3% of the
CompRes 0.898
variance.
Advertising 0.530
ProdLine 0.525 0.712
SalesFImage 0.971
ComPricing -0.590
WartyClaim 0.885
OrdBilling 0.768
DelSpeed 0.949
RC1 RC2 RC3 RC4
SS loadings 2.635 1.967 1.641 1.371
Proportion Var 0.240 0.179 0.149 0.125
Cumulative Var 0.240 0.418 0.568 0.692

3.6.3 PCA/FA Factor Grouping

The obtained loadings are grouped in components with the help of a fa diagram. The fa diagram uses the
loadings and converts them into 4 factors.

The syntax for fa diagram is:

fa.diagram(<datasetname>,cex = <textsize>,e.size <circlesize>,rsize = <boxsize>)

The output figure for PCA and FA are:

26
Principal Component analysis

Factor Analysis

27
Post the PCA/FA, the PCA/FA scores are combined with the “Satisfaction” scores from the parent dataset
(“fact” in the given scenario) to create a dataframe for regression modelling (assigned as pcaregdata and
faregdata for PCA and FA respectively).

To perform the regressing modelling, the factors obtained in PCA and FA are renamed according to the
below table:

Factors Short Interpretation Variables PCA FA


RC1/PA1 Items related to purchasing the DelSpeed, Purchase Sales
product; from placing order, to CompRes,Ordbilling
billing and delivery
RC2/PA2 Items related to marketing SalesFImage, Ecom, Marketing Brandim
processes like image of salesforce Advertising
and advertising expenses
RC3/PA3 Items including post purchase WartyClaim, Afterservice Aftersaleexp
activities like warranty claim and TechSup
technical support
RC4/PA4 Items related to product ProdLine, ProdQual, Productqual Prodplace
positioning ComPricing

4 Multiple Linear Regression (MLR):

Multiple Linear Regression is an extension of simple linear regression. MLR tells the relationship between
the dependent variable and multiple independent continuous variable. In the given scenario, the
dependent variable is “Satisfaction” and the 4 factors obtained in PCA/FA.

The MLR is represented by the below equation:

y = β0 + β1x1 + β2x2 +….. +βkxk

where, y is the dependent variable, x1,x2,…xk is the independent variable and β0 β1, β2, β3… βk are
intercept and variable coefficients respectively.

4.1.1 Regression Modeling:

Regression modeling can be done in two ways:

a) Modeling for the entire data


b) Splitting the data into train and test data for modeling

In the report, the regression has been done with and without splitting

28
i) Regression modelling without data splitting

The syntax used for regression modeling is:

Principal Component Analysis:


lm(Satisfaction~Purchase+Marketing+Afterservice+Productqual,data=<dataset>)
Factor Analysis:
lm(CustSat~Sales+Brandim+Aftersaleexp+Prodplace,data=<dataset>)

The output tables for the regression are:

Principal Component Analysis (without data splitting)


Estimate Std.Error t value Pr(>|t|) Regression Equation:
Intercept 6.91813 0.07087 97.617 < 2e-16 Satisfaction=6.91813+
Purchase 0.61799 0.07122 8.677 1.11E-13 0.61799*Purchase +
Marketing 0.50994 0.07123 7.159 1.71E-10 0.50994 * Marketing +
0.06686 * Afterservice +
Afterservice 0.06686 0.07120 0.939 0.35
0.54014 * Productqual
ProductQual 0.54014 0.07124 7.582 2.27E-11
Multiple R2 Adjusted f- stat p-value DF
2
R
0.6607 0.64640 46.25 < 2.2e- 4 and 95
16 DF
Residual Standard error : 0.7087 on 95 DF

Factor Analysis (without data splitting)


Estimate Std.Error t value Pr(>|t|) Regression Equation:
Intercept 6.91778 0.06693 103.358 < 2e-16 Satisfaction=6.91778+
Sales 0.57956 0.06856 8.453 3.32E- 0.57956*Sales +
13 0.61964* Brandim +
Brandim 0.61964 0.06831 9.071 1.60E- 0.05774 * Aftersaleexp+
14 0.61187 * Prodplace
Aftersaleexp 0.05774 0.07171 0.805 0.423
Prodplace 0.61187 0.07650 7.998 3.03E-
12
Multiple R2 Adjusted f- stat p-value DF
R2
0.6974 0.68460 54.73 < 2.2e-16 4 and
95 DF
Residual Standard error : 0.6693 on 95 DF

a)Checking the validity of the regression model:

29
Backtracking plot

A backtracking graph is plotted between the satisfaction value predicted from the regression and those
from the actual dataset. The strength of the regression modelling depends on how close the coordinates
of predicted value are with actual values.

PCA Backtrack

FA Backtrack

The actual values are plotted with red lines and the predicted values are plotted with blue color.

30
b) Interpretations from the regression model:

• Multiple R2: The values of multiple R2 for PCA and FA are 66.07 and 69.74 respectively. This
implies that 66.07% and 69.74% of variation is explained by the independent variables
• Adjusted R2: The adjusted R2 compensates for the addition of variables and only increases if the
new predictor enhances the model. The adjusted R2 is a modified version of R2 for the number of
predictors in a model. The adjusted R2 value in PCA and FA model (64.64% and 68.46%
respectively) explains the variability in the model.
• Degrees of Freedom: The total number of observations in dataset is 100 and degrees of
freedom(df) of dataset is 99 (100-1). The total number of independent factors is 5 and df is 4 (5-
1). Thereby, the total df for error is (99-4), that is, 95 df
• F statistic and p value: Overall p-value (2.2e-16) of Model given by F-statistic gives evidence
against the null-hypothesis. Model is significantly valid at this point.
• Backtracking model: As we can see, the lines in the backtracking model are almost overlapping,
which implies that actual and predicted values are almost same.

ii) Regression modelling with train and test data:

The dataset is split into train and test data, where train is 70% of the main data and test constitutes the
remaining 30%. The syntax used for splitting is:
set.seed(100)
sample(1:nrow(<datasetname>), 0.7*nrow(<datasetname>))

The syntax used for regression modeling is:


PCA: lm(Satisfaction~Purchase+Marketing+Afterservice+Productqual,data=<dataset>)
FA:lm(CustSat~Sales+Brandim+Aftersaleexp+Prodplace,data=<dataset

The regression tables for PCA (train and test) and FA (train and test) are given below:

PCA Train Data PCA Test Data


Estimate Std.Error t Pr(>|t|) Estimate Std.Error t value Pr(>|t|)
value
Intercept 6.89394 0.09064 76.055 < 2e-16 Intercept 7.15569 0.08762 81.669 < 2e-16
Purchase 0.56130 0.08285 6.775 4.33E- Purchase 0.93570 0.10314 9.072 2.20E-
09 09
Marketing 0.59052 0.09065 6.514 1.24E- Marketing 0.23133 0.08354 2.769 0.0104
08
Afterservice 0.09847 0.08683 1.134 0.261 Afterservice -0.01100 0.08849 -0.124 0.9021
ProductQual 0.47035 0.08590 5.475 7.55E- ProductQual 0.85112 0.09574 8.890 3.26E-
07 09
Multiple R2 Adjusted f- stat p- DF Multiple R2 Adjusted f- stat p-value DF
R2 value R2
0.6536 0.63230 46.25 < 4 and 65 0.8695 0.84860 41.64 1.05E- 4 and
2.42e- DF 10 25 DF
16
Residual Standard error: 0.7455 on 65 DF Residual Standard error: 0.4295 on 25 DF

31
Regression Equation: Regression Equation:
Satisfaction = 6.89394 + (0.56130 * Purchase) + Satisfaction = 7.15569+ (0.93570 * Purchase) +
(0.59052 * Marketing) + (0.23133 * Marketing) +
(0.09847 * Afterservice) + (-0.01100 * Afterservice) +
(0.47035 * ProductQual) (0.85112 * ProductQual)

FA Train Data FA Test Data


Estimate Std.Error t value Pr(>|t|) Estimate Std.Error t value Pr(>|t|)
Intercept 6.88507 0.07720 89.183 < 2e-16 Intercept 7.0102 0.1343 52.203 < 2e-16
Sales 0.60031 0.07416 8.095 1.98E- Sales 0.5169 0.1757 2.942 0.006938
11
Brandim 0.58485 0.08508 6.875 2.89E- Brandim 0.6485 0.1434 4.523 0.000128
09
Aftersaleexp -0.01849 0.07925 -0.233 0.816 Aftersaleexp 0.3002 0.1722 1.743 0.093609
Prodplace 0.55627 0.09408 5.913 1.37E- Prodplace 0.6142 0.1762 3.487 0.001826
07
Multiple R2 Adjusted f- stat p-value DF Multiple R2 Adjusted f- stat p- DF
R2 R2 value
0.6747 0.65470 33.71 3.22E-15 4 and 0.7658 0.7283 20.43 1.40E- 4 and 25
65 DF 07 DF
Residual Standard error : 0.6453 on 65 DF Residual Standard error : 0.7291 on 25 DF
Regression Equation: Regression Equation:
Satisfaction = 6.88507 + (0.60031 * Sales) + Satisfaction = 7.0102 + (0.5169 * Sales) +
(0.58485 * Brandim) + (0.6485 * Brandim) +
(-0.01849 * Aftersaleexp) + (0.3002 * Aftersaleexp) +
(0.55627 * Prodplace) (0.6142 Prodplace)

a) Checking the validity of the regression model:

Backtracking model

32
PCA Train Model

33
PCA Test Model

34
FA Train Model

35
FA Test Model

b) Interpretations from the regression model:

• Multiple R2: The values of multiple R2 for PCA train and test data are 65.36% and
86.95% respectively. This implies that 65.36% and 86.95% of variation is explained by
the independent variables. Similarly, 67.67% and 76.58% of variation is explained for
the FA train and test data respectively.
• Adjusted R2: The adjusted R2 value in PCA train (63.23%) and test (84.86%) explains
the variability of the model and FA train (65.47%) and test (72.83%) model explains
the variability in the FA models.
• Degrees of Freedom: The total number of observations in the train datasets for PCA
and FA is 70 and degrees of freedom(df) of dataset is 69 (70-1). The total number of
independent factors is 5 and df is 4 (5-1). Thereby, the total df for error is (69-4), that
is, 65 df. Similarly, the number errors in the test data sets is 25 (29-4)
• F statistic and p value: Overall p-values of all the 4 models are given by F-statistic are
extremely small (to the power of -16), that gives evidence against the null-
hypothesis. Model is significantly valid at this point.
• Backtracking model: As we can see, the lines in the backtracking model are almost
overlapping, which implies that actual and predicted values are almost same.

36
5 Conclusion

• The given dataset "Factor Hair" has 11 free factors, specifically, ProdQual, Ecom, TechSup,
CompRes, Advertising, ProdLine, SalesFImage, ComPricing, WartyClaim, OrdBilling and DelSpeed. The
Satisfaction is the needy variable.
• Exploratory information examination has been done to discover the class, structure,
dimensionality, outline and to plot the univariate and bivariate graphs (histogram, boxplot, thickness plot,
scatterplot)
• The dataset was broken down for multicollinearity and dependent on the test, there was
connection between numerous autonomous factors. This suggested, to, complete numerous regression,
the autonomous factors must be decreased to factors
• Simple Linear Regression has been performed for the needy variable against every free factor.
The straight regression model and diagrams were inferred.
• The 11 autonomous factors were exposed to Principal Component Analysis and Factor Analysis to
decrease their dimensionalities. In view of the PCA/FA results, these factors were changed over into 4
components.
• These 4 components were utilized for playing out numerous direct regression against with the
reliant variable and plot the backtracking chart.
• Based on the above figuring, it has been reasoned that the 4 autonomous variables clarify the
regression in the information by a decent edge. Further, the anticipated qualities cover the genuine
qualities which shows the legitimacy of the regression model.

37

S-ar putea să vă placă și