Documente Academic
Documente Profesional
Documente Cultură
1|P a g e
2|P a g e
Project Objective
The project is based on Multi-Variate analysis and its application in Market Segmentation in
context of Product Service management. The objective is to check the given data -set and
variables for multicollinearity, perform factor analysis and then build a multiple regression
model to predict Customer Satisfaction based of the four factors chosen during factor analysis.
What is multicollinearity and how it affects the regression model? Multicollinearity occurs when
the independent variables of a regression model are correlated and if the degree of collinearity
between the independent variables are high, it becomes difficult to estimate the relationship
between each independent variable and the dependent variable and the overall precision of
the estimate coefficients. Even though the regression models with high multicollinearity can
give you a high R squared but hardly any significant variables
3|P a g e
Defining Business Problem
Customer Satisfaction has been a topic of great interest to organizations across different
industries. One of the most important objectives of any organization is to make and increase
profits and lessen the costs. The most important factor to increase profits and lessen costs in
Customer Satisfaction, because it leads to customer loyalty, repeat purchases and
recommendations. Moreover, the importance of Customer Satisfaction and Service Quality has
been a proven tool to improve overall performance of any organization.
So, what is Satisfaction? When a customer is contended with a product or service it is known as
Satisfaction. It can also be a person’s feelings of happiness or disappointment which is a result
of comparing the products/service perceived performance with their end outcome.
Assumptions
4|P a g e
Data Dictionary
The target or the dependent variable in the given dataset is “Customer Satisfaction”, which has
values between 1 – 10 meaning 1 is lowest level of Customer Satisfaction and 10 the highest
level of Customer Satisfaction. The dataset has close to 1200 records with 11 different
independent variables. The variables are listed below:
5|P a g e
Data Summary and Exploratory Data Analysis
ProdQual – We check the summary of this variable and after going through the results, we can
conclude that there are no outliers in the data. This can further be substantiated with the help
of Box Plot.
6|P a g e
> summary(mydata$ProdQual)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.000 6.575 8.000 7.810 9.100 10.000
Ecom – We follow the same steps for this variable to check the summary and outliers in the
variable, if any.
After going through the results, we can see that this variable has few outliers, but not
significant enough, to do an outlier treatment.
> summary(mydata$Ecom)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 3.275 3.600 3.672 3.925 5.700
7|P a g e
TechSup – We see the results for this variable w.r.t to summary and outliers and find that this
variable has no outliers.
summary(mydata$TechSup)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.300 4.250 5.400 5.365 6.625 8.500
8|P a g e
CompRes – The summary results for this variable also doesn’t show presence of any outlier in
the data.
summary(mydata$CompRes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.600 4.600 5.450 5.442 6.325 7.800
Advertising – The results for this variable also doesn’t show presence of any outliers.
summary(mydata$Advertising)
9|P a g e
ProdLine – This variable also doesn’t show presence of any outlier.
summary(mydata$ProdLine)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 4.700 5.750 5.805 6.800 8.400
10 | P a g e
SalesFImage – The results of this variable and box plot thus formed, show presence of outliers
in the data.
summary(mydata$SalesFImage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.900 4.500 4.900 5.123 5.800 8.200
summary(mydata$ComPricing)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.700 5.875 7.100 6.974 8.400 9.900
11 | P a g e
WartyClaim – This variable also doesn’t have any outliers.
summary(mydata$WartyClaim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.100 5.400 6.100 6.043 6.600 8.100
12 | P a g e
OrdBilling – The results for this variable show presence of outliers in the data.
summary(mydata$OrdBilling)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.700 4.400 4.278 4.800 6.700
DelSpeed – The results of this variable as well shows presence of outliers in the data.
summary(mydata$DelSpeed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.600 3.400 3.900 3.886 4.425 5.500
13 | P a g e
Satisfaction – The results for this variable doesn’t show presence of any outliers.
summary(mydata$Satisfaction)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.700 6.000 7.050 6.918 7.625 9.900
14 | P a g e
Now, we plot and check the inter-variable Correlation among the variables. The correlation
matrix is shown below:
15 | P a g e
4) OrdBilling and DelSpeed are highly correlated.
5) Ecom and SalesFImage are highly correlated.
As expected the correlation between Sales Force Image and E-Commerce is highly significant, so
is the correlation between delivery speed and order billing with complaint resolution. Also, the
correlation between order billing and delivery speed. Thus, we can safely assume and conclude
that there exists a high degree of collinearity between the independent variables.
16 | P a g e
Factor Analysis
Before getting into factor analysis, let’s first build an initial regression model using all the
independent variables and check the performance of the model.
Call:
lm(formula = Satisfaction ~ ., data = mydata)
Residuals:
Min 1Q Median 3Q Max
-1.43005 -0.31165 0.07621 0.37190 0.90120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.66961 0.81233 -0.824 0.41199
ProdQual 0.37137 0.05177 7.173 2.18e-10 ***
Ecom -0.44056 0.13396 -3.289 0.00145 **
TechSup 0.03299 0.06372 0.518 0.60591
CompRes 0.16703 0.10173 1.642 0.10416
Advertising -0.02602 0.06161 -0.422 0.67382
ProdLine 0.14034 0.08025 1.749 0.08384 .
SalesFImage 0.80611 0.09775 8.247 1.45e-12 ***
ComPricing -0.03853 0.04677 -0.824 0.41235
WartyClaim -0.10298 0.12330 -0.835 0.40587
OrdBilling 0.14635 0.10367 1.412 0.16160
DelSpeed 0.16570 0.19644 0.844 0.40124
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
17 | P a g e
Here, even though independent variables explain 78% of variance of the dependent variables,
only 3 variables are significant out of 11 independent variables.
Remedial Measures
Two most widely and commonly used methods to deal with multi-collinearity in the model is
the following:
1) Remove some of the highly correlated variables using VIF (you will learn about this
technique in the Predictive Modelling course) or stepwise algorithms.
2) Perform an analysis design like principal component analysis (PCA) / Factor Analysis on
the correlated variables.
Factor Analysis
Now let’s check for the factorability of the variables in the dataset.
Now, we will use the fa.parallel function from Psych package in R to execute parallel analysis to
find acceptable number of factors and generate the scree plot.
18 | P a g e
The blue line in the above scree plot/graph shows the eigen values of the actual data and the
two red lines (they are placed on the top of each other thus, looks like one line) show the
simulated and resampled data. Here, we look at the large drops in the actual data and spot the
point where it levels off to the right.
Looking at the plot we can conclude that 3 or 4 factors would be a good choice for further
regression modelling.
Now, we will use these four factors to perform the Factor Analysis. Also, we will be using
orthogonal rotation (VARIMAX) as in orthogonal rotation the rotated factors will remain
uncorrelated whereas in oblique rotation the resulting factors will be correlated.
19 | P a g e
WartyClaim 0.10 0.06 0.89 0.13 0.81 0.186 1.1
OrdBilling 0.77 0.13 0.09 0.09 0.62 0.378 1.1
DelSpeed 0.95 0.19 0.00 0.09 0.94 0.058 1.1
The degrees of freedom for the null model are 55 and the objective function was 6.55 with
Chi Square of 619.27
The degrees of freedom for the model are 17 and the objective function was 0.33
The harmonic number of observations is 100 with the empirical chi square 3.19 with prob < 1
The total number of observations was 100 with Likelihood Chi Square = 30.27 with prob <
0.024
20 | P a g e
Giving nomenclature and Interpretations of the factors:
21 | P a g e
Regression Analysis using the Factors scores as the Independent Variables
Now we will combine the dependent variables and the factor scores into a dataset and label
them.
Now, we run the Multiple Linear Regression model with Satisfaction as the independent
variable and the factors (gotten after doing Factor Analysis) as the dependent variables.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.88600 0.03409 114.000 < 2e-16 ***
Purchase 0.17539 0.03286 5.338 6.38e-07 ***
Marketing 0.15848 0.03764 4.210 5.79e-05 ***
Post_purchase 0.64218 0.03670 17.500 < 2e-16 ***
Prod_positioning 0.00589 0.03637 0.162 0.872
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
22 | P a g e
Conclusion
Thus, we can safely assume and conclude that the multiple regression model built on the
dataset with factor analysis explains almost 65% of variation in the dependent variable and
upon checking the model validity we can conclude that the model is fit enough to be used.
Based on the model variables like Complaint Resolution, Order & Billing, Delivery Speed which
explains the efficiency of the service provided which helps in making the purchase decision, E-
commerce, Sales Fore Image and Advertising explains the brand value of the service referring to
marketing inputs used by the service provider. Lastly, product Quality, Product Line and
Competitive Pricing explains how the product/service has been positioned in the market.
Last but not the least, the majority of variation in the dependent variable “SATISFACTION” can
be attributed to the above-mentioned factors, thus the actionable insight for the service
provider is to focus on these aspects for ascertain high levels of “CUSTOMER SATISFACTION”
23 | P a g e