Advanced Statistics - Model Solution

PREDICTING CUSTOMER SATISFACTION IN SALON
INDUSTRY USING MULTIPLE REGRESSION MODEL
1|P a g e
2|P a g e
Project Objective
The project is based on Multi-Variate analysis and its application in Market Segmentation in
context of Product Service management. The objective is to check the given data -set and
variables for multicollinearity, perform factor analysis and then build a multiple regression
model to predict Customer Satisfaction based of the four factors chosen during factor analysis.
What is multicollinearity and how it affects the regression model? Multicollinearity occurs when
the independent variables of a regression model are correlated and if the degree of collinearity
between the independent variables are high, it becomes difficult to estimate the relationship
between each independent variable and the dependent variable and the overall precision of
the estimate coefficients. Even though the regression models with high multicollinearity can
give you a high R squared but hardly any significant variables
3|P a g e
Defining Business Problem
Customer Satisfaction has been a topic of great interest to organizations across different
industries. One of the most important objectives of any organization is to make and increase
profits and lessen the costs. The most important factor to increase profits and lessen costs in
Customer Satisfaction, because it leads to customer loyalty, repeat purchases and
recommendations. Moreover, the importance of Customer Satisfaction and Service Quality has
been a proven tool to improve overall performance of any organization.
So, what is Satisfaction? When a customer is contended with a product or service it is known as
Satisfaction. It can also be a person’s feelings of happiness or disappointment which is a result
of comparing the products/service perceived performance with their end outcome.
Assumptions
1) There should not be multicollinearity between the variables.
2) The variance of residual is same is that of dependent variable is Satisfaction
3) Linearity exists between dependent and independent variables
4) Observations are independent to each other.
4|P a g e
Data Dictionary
The target or the dependent variable in the given dataset is “Customer Satisfaction”, which has
values between 1 – 10 meaning 1 is lowest level of Customer Satisfaction and 10 the highest
level of Customer Satisfaction. The dataset has close to 1200 records with 11 different
independent variables. The variables are listed below:
Sl. No Variable Name Explanation

1 ProdQual Perceived Level of Quality of Service of the Salon.
2 Ecom Overall image of the Salon Website, especially use friendliness.
3 TechSup Extent to which tech support is given in case of service issues.
4 CompRes Extent to which compliants are resolved in timely and complete manner.
5 Advertising Perception of Advertising done by the Salon.
6 ProdLine Range of product/services offered by the Salon to meet customer needs.
7 SalesFImage Overall image of the Salon's service technicians.
8 ComPricing Extent to which the prices offered by the salon are competitive.
9 WartyClaim Warranties and claim policy of the Salon.
10 OrdBilling Perception that orders and billing are handled efficiently and correctly.
11 DelSpeed Amount of time it takes to deliver a service.
5|P a g e
Data Summary and Exploratory Data Analysis
data.frame': 100 obs. Of 13 variables:

$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ ProdQual : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
$ Ecom : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
$ TechSup : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
$ CompRes : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
$ Advertising : num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
$ ProdLine : num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
$ SalesFImage : num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
$ ComPricing : num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
$ WartyClaim : num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
$ OrdBilling : num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
$ DelSpeed : num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
$ Satisfaction: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
1) The given data-set has 13 variables with 100 observations.

2) We are trying analyze the attributes which lead to level of Customer Satisfaction.
3) Hence, in our data-set Satisfaction will be the target/dependent variable and rest all
variables will be independent variables.
4) Variable ID is just a unique identifier given to each customer and as such has no
meaning when undertaking multiple regression modelling, thus it will be dropped from
our analysis.
5) We have checked for the missing values in the data-set, and could not find any missing
values.
Variable Wise Exploratory Data Analysis
ProdQual – We check the summary of this variable and after going through the results, we can
conclude that there are no outliers in the data. This can further be substantiated with the help
of Box Plot.
6|P a g e
> summary(mydata$ProdQual)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.000 6.575 8.000 7.810 9.100 10.000
Ecom – We follow the same steps for this variable to check the summary and outliers in the
variable, if any.
After going through the results, we can see that this variable has few outliers, but not
significant enough, to do an outlier treatment.
> summary(mydata$Ecom)
2.200 3.275 3.600 3.672 3.925 5.700
7|P a g e
TechSup – We see the results for this variable w.r.t to summary and outliers and find that this
variable has no outliers.
summary(mydata$TechSup)
1.300 4.250 5.400 5.365 6.625 8.500
8|P a g e
CompRes – The summary results for this variable also doesn’t show presence of any outlier in
the data.
summary(mydata$CompRes)
2.600 4.600 5.450 5.442 6.325 7.800
Advertising – The results for this variable also doesn’t show presence of any outliers.
summary(mydata$Advertising)
1.900 3.175 4.000 4.010 4.800 6.500
The box plot is shown on the next page.
9|P a g e
ProdLine – This variable also doesn’t show presence of any outlier.
summary(mydata$ProdLine)
2.300 4.700 5.750 5.805 6.800 8.400
10 | P a g e
SalesFImage – The results of this variable and box plot thus formed, show presence of outliers
in the data.
summary(mydata$SalesFImage)
2.900 4.500 4.900 5.123 5.800 8.200
ComPricing – This variable doesn’t have any outliers.
summary(mydata$ComPricing)
3.700 5.875 7.100 6.974 8.400 9.900
11 | P a g e
WartyClaim – This variable also doesn’t have any outliers.
summary(mydata$WartyClaim)
4.100 5.400 6.100 6.043 6.600 8.100
12 | P a g e
OrdBilling – The results for this variable show presence of outliers in the data.
summary(mydata$OrdBilling)
2.000 3.700 4.400 4.278 4.800 6.700
DelSpeed – The results of this variable as well shows presence of outliers in the data.
summary(mydata$DelSpeed)
1.600 3.400 3.900 3.886 4.425 5.500
13 | P a g e
Satisfaction – The results for this variable doesn’t show presence of any outliers.
summary(mydata$Satisfaction)
4.700 6.000 7.050 6.918 7.625 9.900
14 | P a g e
Now, we plot and check the inter-variable Correlation among the variables. The correlation
matrix is shown below:
Observations for the Correlation Matrix/plot:
1) CompRes and DelSpeed are highly correlated.

2) OrdBilling and CompRes are highly correlated.
3) WartyClaim and TechSup are highly correlated.
15 | P a g e
4) OrdBilling and DelSpeed are highly correlated.
5) Ecom and SalesFImage are highly correlated.
As expected the correlation between Sales Force Image and E-Commerce is highly significant, so
is the correlation between delivery speed and order billing with complaint resolution. Also, the
correlation between order billing and delivery speed. Thus, we can safely assume and conclude
that there exists a high degree of collinearity between the independent variables.
Note: (For results of t-test and p-values please refer Appendix – A)
16 | P a g e
Factor Analysis
Before getting into factor analysis, let’s first build an initial regression model using all the
independent variables and check the performance of the model.
Call:
lm(formula = Satisfaction ~ ., data = mydata)
Residuals:
Min 1Q Median 3Q Max
-1.43005 -0.31165 0.07621 0.37190 0.90120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.66961 0.81233 -0.824 0.41199
ProdQual 0.37137 0.05177 7.173 2.18e-10 ***
Ecom -0.44056 0.13396 -3.289 0.00145 **
TechSup 0.03299 0.06372 0.518 0.60591
CompRes 0.16703 0.10173 1.642 0.10416
Advertising -0.02602 0.06161 -0.422 0.67382
ProdLine 0.14034 0.08025 1.749 0.08384 .
SalesFImage 0.80611 0.09775 8.247 1.45e-12 ***
ComPricing -0.03853 0.04677 -0.824 0.41235
WartyClaim -0.10298 0.12330 -0.835 0.40587
OrdBilling 0.14635 0.10367 1.412 0.16160
DelSpeed 0.16570 0.19644 0.844 0.40124
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5623 on 88 degrees of freedom

Multiple R-squared: 0.8021, Adjusted R-squared: 0.7774
F-statistic: 32.43 on 11 and 88 DF, p-value: < 2.2e-16
17 | P a g e
Here, even though independent variables explain 78% of variance of the dependent variables,
only 3 variables are significant out of 11 independent variables.
Remedial Measures
Two most widely and commonly used methods to deal with multi-collinearity in the model is
the following:
1) Remove some of the highly correlated variables using VIF (you will learn about this
technique in the Predictive Modelling course) or stepwise algorithms.
2) Perform an analysis design like principal component analysis (PCA) / Factor Analysis on
the correlated variables.
Factor Analysis
Now let’s check for the factorability of the variables in the dataset.
Now, we will use the fa.parallel function from Psych package in R to execute parallel analysis to
find acceptable number of factors and generate the scree plot.
18 | P a g e
The blue line in the above scree plot/graph shows the eigen values of the actual data and the
two red lines (they are placed on the top of each other thus, looks like one line) show the
simulated and resampled data. Here, we look at the large drops in the actual data and spot the
point where it levels off to the right.
Looking at the plot we can conclude that 3 or 4 factors would be a good choice for further
regression modelling.
Now, we will use these four factors to perform the Factor Analysis. Also, we will be using
orthogonal rotation (VARIMAX) as in orthogonal rotation the rotated factors will remain
uncorrelated whereas in oblique rotation the resulting factors will be correlated.
Factor Analysis using method = pa

Call: fa(r = mydata1, nfactors = 4, rotate = "varimax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 PA4 h2 u2 com
ProdQual 0.02 -0.07 0.02 0.65 0.42 0.576 1.0
Ecom 0.07 0.79 0.03 -0.11 0.64 0.362 1.1
TechSup 0.02 -0.03 0.88 0.12 0.79 0.205 1.0
CompRes 0.90 0.13 0.05 0.13 0.84 0.157 1.1
Advertising 0.17 0.53 -0.04 -0.06 0.31 0.686 1.2
ProdLine 0.53 -0.04 0.13 0.71 0.80 0.200 1.9
SalesFImage 0.12 0.97 0.06 -0.13 0.98 0.021 1.1
ComPricing -0.08 0.21 -0.21 -0.59 0.44 0.557 1.6
19 | P a g e
WartyClaim 0.10 0.06 0.89 0.13 0.81 0.186 1.1
OrdBilling 0.77 0.13 0.09 0.09 0.62 0.378 1.1
DelSpeed 0.95 0.19 0.00 0.09 0.94 0.058 1.1
PA1 PA2 PA3 PA4

SS loadings 2.63 1.97 1.64 1.37
Proportion Var 0.24 0.18 0.15 0.12
Cumulative Var 0.24 0.42 0.57 0.69
Proportion Explained 0.35 0.26 0.22 0.18
Cumulative Proportion 0.35 0.60 0.82 1.00
Mean item complexity = 1.2

Test of the hypothesis that 4 factors are sufficient.
The degrees of freedom for the null model are 55 and the objective function was 6.55 with
Chi Square of 619.27
The degrees of freedom for the model are 17 and the objective function was 0.33
The root mean square of the residuals (RMSR) is 0.02

The df corrected root mean square of the residuals is 0.03
The harmonic number of observations is 100 with the empirical chi square 3.19 with prob < 1
The total number of observations was 100 with Likelihood Chi Square = 30.27 with prob <
0.024
Tucker Lewis Index of factoring reliability = 0.921

RMSEA index = 0.096 and the 90 % confidence intervals are 0.032 0.139
BIC = -48.01
Fit based upon off diagonal values = 1
Measures of factor score adequacy
PA1 PA2 PA3 PA4
Correlation of (regression) scores with factors 0.98 0.99 0.94 0.88
Multiple R square of scores with factors 0.96 0.97 0.88 0.78
Minimum correlation of possible factor scores 0.93 0.94 0.77 0.55
Let’s also visualize the factors:
20 | P a g e
Giving nomenclature and Interpretations of the factors:
Sl. No Factors Nomenclature Label Interpretations

All the items are related to purchasing
1 PA1 DelSpeed, CompRes, OrdBilling Buying Cycle the product, from placing the order to
billing and getting it delivered
In this factor the items realated to
marketing processes like image of the
2 PA2 SalesFImage, Ecom, Advertising Markting Inputs Salon, Sales Force efficiency and impact
of E-commerce on salon business are
clubbed together.
Post purchase activities like warranty,
3 PA3 WartyClaim, TechSup Post Purchase Inputs claim and technical support are clubbed
together.
Product positioning factors like range of
product, quality of the product/services
4 PA4 ProdLine, ProdQual, ComPricing Product Positioning
and how competitively is the product
priced forms part of this factor.
21 | P a g e
Regression Analysis using the Factors scores as the Independent Variables
Now we will combine the dependent variables and the factor scores into a dataset and label
them.
Sl. No Satisfaction PA1 PA2 PA3 PA4

1 8.2 -0.1338871 0.9175166 -1.719604873 0.09135411
2 5.7 1.6297604 -2.0090053 -0.596361722 0.65808192
3 8.9 0.3637658 0.8361736 0.002978866 1.37548765
4 4.8 -1.222523 -0.5491336 1.245473305 -0.6442138
5 7.1 -0.4854209 -0.4276223 -0.026980304 0.47360747
6 4.7 0.5950924 -1.03035333 -1.183019401 -0.9591357
Now, we run the Multiple Linear Regression model with Satisfaction as the independent
variable and the factors (gotten after doing Factor Analysis) as the dependent variables.
The summary of the Multiple Linear Regression Model is as follows:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.88600 0.03409 114.000 < 2e-16 ***
Purchase 0.17539 0.03286 5.338 6.38e-07 ***
Marketing 0.15848 0.03764 4.210 5.79e-05 ***
Post_purchase 0.64218 0.03670 17.500 < 2e-16 ***
Prod_positioning 0.00589 0.03637 0.162 0.872
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3409 on 95 degrees of freedom

Multiple R-squared: 0.7933, Adjusted R-squared: 0.7846
F-statistic: 91.14 on 4 and 95 DF, p-value: < 2.2e-16
22 | P a g e
Conclusion
Thus, we can safely assume and conclude that the multiple regression model built on the
dataset with factor analysis explains almost 65% of variation in the dependent variable and
upon checking the model validity we can conclude that the model is fit enough to be used.
Based on the model variables like Complaint Resolution, Order & Billing, Delivery Speed which
explains the efficiency of the service provided which helps in making the purchase decision, E-
commerce, Sales Fore Image and Advertising explains the brand value of the service referring to
marketing inputs used by the service provider. Lastly, product Quality, Product Line and
Competitive Pricing explains how the product/service has been positioned in the market.
Last but not the least, the majority of variation in the dependent variable “SATISFACTION” can
be attributed to the above-mentioned factors, thus the actionable insight for the service
provider is to focus on these aspects for ascertain high levels of “CUSTOMER SATISFACTION”
23 | P a g e

Advanced Statistics - Model Solution

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Advanced Statistics - Model Solution

Încărcat de

Drepturi de autor:

Formate disponibile

PREDICTING CUSTOMER SATISFACTION IN SALON

INDUSTRY USING MULTIPLE REGRESSION MODEL

1) There should not be multicollinearity between the variables.

2) The variance of residual is same is that of dependent variable is Satisfaction

3) Linearity exists between dependent and independent variables

4) Observations are independent to each other.

Sl. No Variable Name Explanation

data.frame': 100 obs. Of 13 variables:

1) The given data-set has 13 variables with 100 observations.

Variable Wise Exploratory Data Analysis

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.900 3.175 4.000 4.010 4.800 6.500

The box plot is shown on the next page.

ComPricing – This variable doesn’t have any outliers.

The box plot is shown on the next page.

The box plot is shown on the next page.

Observations for the Correlation Matrix/plot:

1) CompRes and DelSpeed are highly correlated.

Note: (For results of t-test and p-values please refer Appendix – A)

Residual standard error: 0.5623 on 88 degrees of freedom

Factor Analysis using method = pa

PA1 PA2 PA3 PA4

Mean item complexity = 1.2

The root mean square of the residuals (RMSR) is 0.02

Tucker Lewis Index of factoring reliability = 0.921

Let’s also visualize the factors:

Sl. No Factors Nomenclature Label Interpretations

Sl. No Satisfaction PA1 PA2 PA3 PA4

The summary of the Multiple Linear Regression Model is as follows:

Residual standard error: 0.3409 on 95 degrees of freedom

S-ar putea să vă placă și