Documente Academic
Documente Profesional
Documente Cultură
Suraj Ramkumar
Rahul Godbole
Harshvardhan Kadam
Ankit Popat
Ashish Srivastava
05/01/2020
1
TABLE OF CONTENTS
Introduction .............................................................................................................................................................................................. 3
a. Basic data summary, Univariate, Bivariate analysis, graphs, outliers, missing values ................................... 3
a. Bagging ........................................................................................................................................................................................... 37
2
Introduction
The dataset provided as part of this assignment consists of personal details of employees and their
preferred mode of transport. We need to analyze the data using machine learning models and predict
whether or not an employee will use Car as a mode of transport. We also need to analyze to understand
which variables are significant predictors behind this decision.
## [1] 444 9
There are 444 rows and 9 columns provided as part of the dataset.
i. Structure of Data
str(cars)
The columns ‘Engineer’, ‘MBA’ and ‘license’ need to be converted to factor variables. This is done below:
cars$Engineer <- as.factor(as.character(cars$Engineer))
cars$Engineer <- factor(cars$Engineer, labels = c("Non-Engineer","Engineer"))
cars$MBA <- as.factor(as.character(cars$MBA))
cars$MBA <- factor(cars$MBA, labels = c("Non-MBA","MBA"))
cars$license <- as.factor(as.character(cars$license))
cars$license <- factor(cars$license, labels = c("No license","Licensed"))
str(cars)
## Missing_Values
## Age 0
## Gender 0
## Engineer 0
## MBA 1
## Work.Exp 0
## Salary 0
## Distance 0
## license 0
## Transport 0
We observe there is a missing value in the column ‘MBA’ i.e. row 145. We can either impute the missing
data by the majority class i.e. 0 or we can impute using algorithms like KNN impute.
Let us impute this missing value using KNN imputation algorithm.
4
library(DMwR)
cars<-knnImputation(cars)
cars[145,]
We have now imputed the missing data for this row (imputed value is 0).
iv. The Initial Hypothesis
Null Hypothesis (Ho) –No predictor is able to predict the mode of transport
Alternate Hypothesis (Ha)–At least one of the predictors is able to predict the mode of transport.
##
## No Yes
## 383 61
There are 25 outliers for the variable ‘Age’. These outliers are:
cars$Age[which(cars$Age %in% OutVals_age)]
## [1] 39 39 39 38 40 38 38 38 38 40 40 39 40 38 39 38 40 39 38 42 40 43 40
## [24] 38 39
Among all the values observed as outliers for ‘Age’, none of the values seem to be a typo or impossible
values. Hence, we will not treat these outliers.
3) Gender
summary(cars$Gender)
## Female Male
## 128 316
6
gender_ratio <- round(sum(cars$Gender=="Male")/sum(cars$Gender=="Female"),2)
##
## Non-Engineer Engineer
## 109 335
##
## Non-MBA MBA
## 332 112
7
mba_ratio <- round(sum(cars$MBA=="MBA")/sum(cars$MBA=="Non-MBA"),2)
There are 38 outliers identified for the variable ‘Work Exp’. These outliers are:
cars$Work.Exp[which(cars$Work.Exp %in% OutVals_work.ex)]
## [1] 19 16 21 17 16 18 19 18 21 16 19 19 18 19 20 22 16 20 18 21 20 20 16
## [24] 17 21 18 20 21 19 22 22 19 24 20 19 19 19 21
Among all the values observed as outliers for ‘Work Exp’, none of the values seem to be a typo or
impossible values. Hence, we will not treat these outliers.
8
7) Salary
summary(cars$Salary)
.
There are 59 outliers identified for the variable ‘Salary’. These are:
cars$Salary[which(cars$Salary %in% OutVals_sal)]
## [1] 36.6 38.9 25.9 34.8 28.8 39.9 39.0 28.7 36.9 28.7 34.9 47.0 28.8 36.9
## [15] 54.0 29.9 34.9 36.0 44.0 37.0 24.9 43.0 37.0 54.0 44.0 34.0 48.0 42.0
## [29] 51.0 45.0 34.0 28.8 45.0 42.9 41.0 40.9 30.9 41.9 43.0 33.0 36.0 33.0
## [43] 38.0 46.0 45.0 48.0 35.0 51.0 51.0 55.0 45.0 42.0 52.0 38.0 57.0 44.0
## [57] 45.0 47.0 50.0
Among all the values observed as outliers for ‘Salary’, none of the values seem to be a typo or impossible
values. Hence, we will not treat these outliers.
8) Distance
summary(cars$Distance)
9
.
There are 9 outliers identified for the variable ‘Distance’. These are:
cars$Distance[which(cars$Distance %in% OutVal_dist)]
## [1] 20.7 20.8 21.0 21.3 21.4 21.5 21.5 22.8 23.4
Among all the values observed as outliers for ‘Distance’, none of the values seem to be a typo or
impossible values. Hence, we will not treat these outliers.
9) License
table(cars$license)
##
## No license Licensed
## 340 104
10
vi. Bivariate Analysis
Let us now check for bivariate analysis i.e. ‘Transport_car’ with respect to other variables.
1) Age and Transport_Car
boxplot(cars$Age~cars$Transport_car, col = "cyan", main = "Boxplot of Age and
Transport_car", xlab = "Transport_car", ylab = "Age")
Here, we observe that the average age of employees travelling by car is more than those who do not
travel by car.
2) Gender and Transport_Car
# Crosstab
gender.car<-table(cars$Gender,cars$Transport_car)
gender.car.totals<-addmargins(gender.car)
gender.car.totals
##
## No Yes Sum
## Female 115 13 128
## Male 268 48 316
## Sum 383 61 444
# sided barplot
barplot(gender.car,beside = T,xlab = "Car Transport",ylab="Count",legend =
rownames(gender.car),args.legend = list(x="topright"), col=c("pink","cyan"),main="Car
Transport by Gender")
11
Observations:
1) Car transport within Gender: The proportion of males and females travelling by car is significantly
less than those who do not travel by car.
2) Car transport by Gender: Likewise, we observe that more males travel by car as compared to their
female counterparts who travel by car.
##
## No Yes Sum
## Non-Engineer 100 9 109
## Engineer 283 52 335
## Sum 383 61 444
# sided barplot
barplot(eng.car,beside = T,xlab = "Car Transport",ylab="Count",legend =
rownames(eng.car),args.legend = list(x="topright"), col=c("light green","cyan"),main="Car
Transport by Engineer")
12
Observations:
1) Car transport within Engineers: Majority number of Engineers does not travel by car.
2) Car transport by Engineers v/s non-Engineers: We observe that more Engineers travel by car as
compared to non-engineers who travel by car.
##
## No Yes Sum
## Non-MBA 283 49 332
## MBA 100 12 112
## Sum 383 61 444
# sided barplot
barplot(mba.car,beside = T,xlab = "Car Transport",ylab="Count",legend =
rownames(mba.car),args.legend = list(x="topright"), col=c("light green","cyan"),main="Car
Transport by MBAs")
13
Observations:
1) Car transport within MBAs: Majority number of MBAs does not travel by car.
2) Car transport by MBAs v/s non-MBAs: We observe that more non-MBAs travel by car as compared
to MBAs who travel by car.
Here, we observe that on an average, employees with more years of work experience tend to travel by car
as compared to those with less years of work experience.
14
6) Salary and Transport_Car
boxplot(cars$Salary~cars$Transport_car, col = "cyan", main = "Boxplot of Salary and
Transport_car", xlab = "Transport_car", ylab = "Salary")
Here, we observe that on an average, employees with higher salary tend to travel by car as compared to
those with lesser salary. However, few with moderately higher salaries do not travel by car (captured as
outliers in Transport_car ‘No’).
7) Distance and Transport_Car
boxplot(cars$Distance~cars$Transport_car, col = "cyan", main = "Boxplot of Distance and
Transport_car", xlab = "Transport_car", ylab = "Distance")
Here, we observe that on an average, employees staying closer to office (less distance from office) prefer
not to travel by car as compared to those staying away(higher distance from office).
15
8) License and Transport_Car
# Crosstab
license.car<-table(cars$license,cars$Transport_car)
license.car.totals<-addmargins(license.car)
license.car.totals
##
## No Yes Sum
## No license 327 13 340
## Licensed 56 48 104
## Sum 383 61 444
# sided barplot
barplot(license.car,beside = T,xlab = "Car Transport",ylab="Count",legend =
rownames(license.car),args.legend = list(x="topright"), col=c("light
green","cyan"),main="Car Transport and License")
Observations:
Almost equal number of employees with license travel or do not travel by car.
Note: In the above figure, we see some employees without a license travel by car. However, thsi
interpretation is incorrect. These are the employees who have a license for 2 wheelers and not the car
(remember we create a separate column for car transport). OR They are the accompanying passengers in
the car transport.
16
b. Check for Multicollinearity
Let us create a correlation plot for the given dataset and understand which variables are highly
correlated.
library(corrplot)
cars.num <- cars[-9]
cars.num <- sapply(cars.num, as.numeric)
cars.cor <- cor(cars.num)
corrplot.mixed(cars.cor, main = "Correlation plot")
2) The variables Distance and License have comparatively weaker correlations with other variables.
3) The variables Gender, Engineer and MBA have no correlations with other variables
17
c. Interpretation of Business problem and observations
In the given scenario, let us assume we are an automobile company who is looking for strategies to
improve sales and the objective of this exercise is to target employees of an organization to promote
automobile sales (assuming employer has obtained prior approval from employees to share their
personal details with the automobile company).
Business Problem: To increase automobile sales (while optimizing marketing/sales costs) by
targeting employees of organizations based on their personal, education and work related
characteristics.
The key observations (summary) based on exploratory analysis are as follows:
In the subsequent sections, we will create a predictive model based on logistic regression and machine
learning models to understand the odds of employees travelling by car based on different variables.
18
2) Data Preparation
In section 1, we have performed few aspects of data preparation i.e.
1) Conversion of data types to relevant formats (e.g. numeric to factor)
cars_train<-cars[pd==1,]
cars_test<-cars[pd==2,]
nrow(cars_train)/nrow(cars)
## [1] 0.6891892
nrow(cars_test)/nrow(cars)
## [1] 0.3108108
We will also check the target variable response rate in both the train and test data.
rr.train<-sum(cars_train$Transport_car == "Yes")/nrow(cars_train)
rr.test<-sum(cars_test$Transport_car == "Yes")/nrow(cars_test)
##
## No Yes
## 269 37
## [1] 314
table(smoted.cars_train$Transport_car)
19
##
## No Yes
## 166 148
prop.table(table(smoted.cars_train$Transport_car))
##
## No Yes
## 0.5286624 0.4713376
After applying SMOTE, we have now balanced the target variable responder class in the training data.
20
3) Prediction Models
a. Logistic Regression
Let us now apply logistic regression to the training dataset considering all the variables.
i. Train Model 1
#?glm()
m1<-glm(Transport_car ~.,data=smoted.cars_train, family="binomial")
summary(m1)
##
## Call:
## glm(formula = Transport_car ~ ., family = "binomial", data = smoted.cars_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.82195 -0.01496 -0.00003 0.02435 1.56132
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -88.62249 18.30654 -4.841 1.29e-06 ***
## Age 2.93564 0.61732 4.755 1.98e-06 ***
## GenderMale -0.15685 0.77881 -0.201 0.84039
## EngineerEngineer 0.07116 1.14323 0.062 0.95037
## MBAMBA -1.51164 0.80189 -1.885 0.05942 .
## Work.Exp -1.54290 0.38321 -4.026 5.67e-05 ***
## Salary 0.25046 0.07626 3.284 0.00102 **
## Distance 0.48587 0.17777 2.733 0.00627 **
## licenseLicensed 1.04909 0.72986 1.437 0.15061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.264 on 313 degrees of freedom
## Residual deviance: 61.917 on 305 degrees of freedom
## AIC: 79.917
##
## Number of Fisher Scoring iterations: 9
Interpretation:
The statistically significant variables according this model are:
1) Age
2) Work Exp
3) Salary
4) Distance
The AIC value of this model is 79.917
The Null deviance > Residual deviance implying that this model exists.
21
VIF: Model 1
library(car)
vif(m1)
The VIF of Age, Work.Exp and Salary are high implying presence of multi-collinearity in the model. Let us
remove one of these variables and run the model again.
ii. Train Model 2
# Remove Work Exp variable
smoted.cars_train2 <- smoted.cars_train[,-5]
# Logistic Model
m2<-glm(Transport_car ~.,data=smoted.cars_train2, family="binomial")
summary(m2)
##
## Call:
## glm(formula = Transport_car ~ ., family = "binomial", data = smoted.cars_train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.93780 -0.08114 -0.00327 0.04801 2.42137
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.10441 6.77944 -5.768 8.02e-09 ***
## Age 1.12086 0.21381 5.242 1.58e-07 ***
## GenderMale -0.98921 0.65216 -1.517 0.12931
## EngineerEngineer 0.25594 0.89157 0.287 0.77406
## MBAMBA -1.61003 0.68234 -2.360 0.01830 *
## Salary 0.01248 0.04275 0.292 0.77036
## Distance 0.34922 0.11664 2.994 0.00275 **
## licenseLicensed 1.51176 0.62548 2.417 0.01565 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.264 on 313 degrees of freedom
## Residual deviance: 87.298 on 306 degrees of freedom
## AIC: 103.3
##
## Number of Fisher Scoring iterations: 8
22
1) Age
2) MBA
3) Distance
4) License
The AIC value is 103.57 and Null deviance > Residual Deviance implying that a model exists.
VIF: Model 2
vif(m2)
The VIF for all variables is acceptable. Let us proceed with this model for further validations.
iii. Log Regression Model 2: Model significance verification
1) Log Likelihood ratio test
library(lmtest)
lrtest(m2)
Interpretation
H0: All betas are zero
H1: Atleast 1 beta is non-zero
From the log likelihood, we can see that, intercept only model -217.132 variance was unknown to us.
When we take the model (m2), -43.649 variance was unknown to us. So we can say that, 1 – (-43.649 /-
217.132) = 79.89% of the uncertainty inherent in the intercept only model is calibrated by the model
(m2).
Chisq likelihood ratio is significant. Also the p value suggests that we can accept the Alternate Hypothesis
that at least one of the beta is not zero. So Model is significant.
2) McFadden’s pseudo R Square test
library(pscl)
Pseudo_m2<-pR2(m2)
Pseudo_m2
23
## r2CU
## 0.8926938
Interpretation: Based on McFadden R Square, we conclude that 79.9% of the uncertainty of the
Intercept only Model (Model 1) has been explained by the Full Model (Model 2). Thus the goodness of fit
is robust.
iv. Optimize LM using step() function
In this step, we will optimize the model m2 so as to consider the variables that result in the minimum
acceptable AIC.
m2 <- step(glm(Transport_car~.,data = smoted.cars_train2, family = "binomial"))
## Start: AIC=103.3
## Transport_car ~ Age + Gender + Engineer + MBA + Salary + Distance +
## license
##
## Df Deviance AIC
## - Engineer 1 87.380 101.38
## - Salary 1 87.384 101.38
## <none> 87.298 103.30
## - Gender 1 89.719 103.72
## - MBA 1 93.485 107.48
## - license 1 93.579 107.58
## - Distance 1 98.061 112.06
## - Age 1 154.710 168.71
##
## Step: AIC=101.38
## Transport_car ~ Age + Gender + MBA + Salary + Distance + license
##
## Df Deviance AIC
## - Salary 1 87.488 99.488
## <none> 87.380 101.380
## - Gender 1 89.763 101.763
## - license 1 93.626 105.626
## - MBA 1 93.946 105.946
## - Distance 1 98.065 110.065
## - Age 1 155.363 167.363
##
## Step: AIC=99.49
## Transport_car ~ Age + Gender + MBA + Distance + license
##
## Df Deviance AIC
## <none> 87.488 99.488
## - Gender 1 89.781 99.781
## - license 1 93.864 103.864
## - MBA 1 94.050 104.050
## - Distance 1 98.946 108.946
## - Age 1 252.255 262.255
summary(m2)
##
## Call:
24
## glm(formula = Transport_car ~ Age + Gender + MBA + Distance +
## license, family = "binomial", data = smoted.cars_train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.92044 -0.07972 -0.00299 0.04800 2.43207
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.5960 6.4534 -6.136 0.000000000848 ***
## Age 1.1496 0.1992 5.770 0.000000007907 ***
## GenderMale -0.9488 0.6406 -1.481 0.13861
## MBAMBA -1.6461 0.6786 -2.426 0.01527 *
## Distance 0.3539 0.1151 3.075 0.00211 **
## licenseLicensed 1.4852 0.6105 2.433 0.01498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.264 on 313 degrees of freedom
## Residual deviance: 87.488 on 308 degrees of freedom
## AIC: 99.488
##
## Number of Fisher Scoring iterations: 8
pR2(m2)
Interpretation: Based on McFadden R Square for the optimized model, we conclude that 79.85% of the
uncertainty of the Intercept only Model (Model 1) has been explained by the Full Model (Model 2). Thus
the goodness of fit is robust.
v. Odds Explanatory Power
Let’s find out the power of Odds and Probability of the variables impacting employee’s decision to
commute by car.
round(exp(coef(m2)),4) # Odds Ratio
round(exp(coef(m2))/(1+exp(coef(m2))),4) # Probability
# Odds Ratio
odds_m2=exp(coef(m2))
probability=odds_m2/(1+odds_m2)
sum_odds_m2<-odds_m2[2]+odds_m2[3]+odds_m2[4]+odds_m2[5]+odds_m2[6]+odds_m2[7]+odds_m2[8]
varImp<-odds_m2/sum_odds_m2*100
odds_mat<-data.frame(odds_m2,probability,varImp)
odds_mat$probability <- odds_mat$probability*100
options(scipen=999)
round(odds_mat, 2)
2) When the employee is an MBA, the odds of opting for car transport is 0.2 as compared to when the
employee is not an MBA.
3) When the employee has a driving license, the odds of opting for car transport is 4.53 as compared to
when the employee does not have a driving license.
# convert to factor
smoted.cars_train2$class_m2<-factor(smoted.cars_train2$class_m2, labels = c("No","Yes"))
# convert to factor
cars_test$class_m2<-factor(cars_test$class_m2, labels = c("No","Yes"))
library(Deducer)
rocplot(m2)
#AUC for m2
pred_m2_train <- prediction(smoted.cars_train2$prob_m2,smoted.cars_train2$Transport_car)
perf_m2_train <- performance(pred_m2_train, "tpr", "fpr")
## [1] 98.81
# KS for m2
KS_m2_train <- round(max(attr(perf_m2_train, 'y.values')[[1]]-attr(perf_m2_train,
'x.values')[[1]]),4)*100
KS_m2_train
## [1] 91.05
# Gini for m2
gini_m2_train = round(ineq(smoted.cars_train2$prob_m2, type="Gini"),4)*100
gini_m2_train
## [1] 51.61
On Test Data
## Model 2
# KS, Gini and AUC for model m2
#library(ROCR)
#library(ineq)
#AUC for m2
pred_m2_test <- prediction(cars_test$prob_m2,cars_test$Transport_car)
perf_m2_test <- performance(pred_m2_test, "tpr", "fpr")
## [1] 99.52
# KS for m2
KS_m2_test <- round(max(attr(perf_m2_test, 'y.values')[[1]]-attr(perf_m2_test,
'x.values')[[1]]),4)*100
KS_m2_test
## [1] 94.96
# Gini for m2
gini_m2_test = round(ineq(cars_test$prob_m2, type="Gini"),4)*100
gini_m2_test
## [1] 79.4
The model validation metrics (KS, AUC, Gini) obtained imply that this is a robust logistic regression
model.
29
b. K-Nearest Neighbor (KNN) Model
Let us create Train and Test data to create KNN model. We will use the same train and test data as used
for logistic regression.
# Create Train and Test Data for KNN
smoted.cars_train.knn <- smoted.cars_train
cars_test.knn <- cars_test[,-c(10,11)]
# Test data
cars_test.knn$Gender <- as.numeric(cars_test.knn$Gender)
cars_test.knn$Engineer <- as.numeric(cars_test.knn$Engineer)
cars_test.knn$MBA <- as.numeric(cars_test.knn$MBA)
cars_test.knn$license <- as.numeric(cars_test.knn$license)
cars_test.knn$Transport_car <- as.numeric(cars_test.knn$Transport_car)
str(cars_test.knn)
Interpretation
By using KNN, we have predicted the employees (on test data) who will be travelling by car with an
accuracy of over 91% which seems acceptable.
32
c. Naive Bayes Model
i. Applicability of Naive Bayes on this dataset
Naive Bayes algorithm can be used for classification where the target variable responder class is of the
nature 0/1 or Yes/No. In the original form, the Target variable (Transport) responder class had 3 classes
i.e. ‘2Wheeler’, ‘Car’ and ‘Public Transport’. Hence, Naive Bayes cannot be applied directly to predict this
column.
Instead, the column can be transformed. Since, we need to focus on prediction of employee transport
using Car only, let us create a new column (Boolean) for transport by car. Value will be 1 for Car and 0 for
either 2wheeler or Public Transport. This form can then be applied for Naive Bayes prediction model.
This transformation has been done in section (1.a) of the assignment. We will now proceed with the
remaining steps of Naive Bayes for model building and prediction.
ii. Data for Naive Bayes
Let us create Train and Test data to create Naive Bayes model. We will use the same train and test data as
used for logistic regression.
# Create Train and Test Data for Naive Bayes
smoted.cars_train.nb <- smoted.cars_train
cars_test.nb <- cars_test[,-c(10,11)]
# Confusion Matrix
confusionMatrix(smoted.cars_train.nb$pred_nb1,smoted.cars_train.nb$Transport_car)
Interpretation
By using Naive Bayes, we have predicted the employees (on test data) who will be travelling by car with
an accuracy of over 97% which seems very good.
34
d. Confusion Matrix Interpretation
In the above sections, we have created prediction models using Logistic regression, KNN and Naive Bayes model. Let us look at the confusion
matrices of the three models on the basis of test data and interpret them.
Logistic Regression - Model m2 KNN Naive Bayes
# Confusion matrix for model m2 on # Confusion Matrix for KNN # Confusion Matrix for Naive Bayes
test data
confusionMatrix(cars_test$class_m2,car confusionMatrix(cars_testlabels_pred confusionMatrix(cars_test.nb$pred_nb1,c
s_test$Transport_car) ,cars_testlabels) ars_test.nb$Transport_car)
## Confusion Matrix and Statistics ## Confusion Matrix and Statistics ## Confusion Matrix and Statistics
## ## ##
## Reference ## Reference ## Reference
## Prediction No Yes ## Prediction No Yes ## Prediction No Yes
## No 112 1 ## No 104 3 ## No 113 3
## Yes 2 23 ## Yes 10 21 ## Yes 1 21
## ## ##
## Accuracy : 0.9783 ## Accuracy : 0.9058 ## Accuracy : 0.971
## 95% CI : (0.9378, 0.9955) ## 95% CI : (0.8443, 0.9489) ## 95% CI : (0.9274, 0.992)
## No Information Rate : 0.8261 ## No Information Rate : 0.8261 ## No Information Rate : 0.8261
## P-Value [Acc > NIR] : 0.00000001576 ## P-Value [Acc > NIR] : 0.006168 ## P-Value [Acc > NIR] : 0.0000001165
## ## ##
## Kappa : 0.9256 ## Kappa : 0.706 ## Kappa : 0.8957
## Mcnemar's Test P-Value : 1 ## Mcnemar's Test P-Value : 0.096092 ## Mcnemar's Test P-Value : 0.6171
## ## ##
## Sensitivity : 0.9825 ## Sensitivity : 0.9123 ## Sensitivity : 0.9912
## Specificity : 0.9583 ## Specificity : 0.8750 ## Specificity : 0.8750
## Pos Pred Value : 0.9912 ## Pos Pred Value : 0.9720 ## Pos Pred Value : 0.9741
## Neg Pred Value : 0.9200 ## Neg Pred Value : 0.6774 ## Neg Pred Value : 0.9545
## Prevalence : 0.8261 ## Prevalence : 0.8261 ## Prevalence : 0.8261
## Detection Rate : 0.8116 ## Detection Rate : 0.7536 ## Detection Rate : 0.8188
## Detection Prevalence : 0.8188 ## Detection Prevalence : 0.7754 ## Detection Prevalence : 0.8406
## Balanced Accuracy : 0.9704 ## Balanced Accuracy : 0.8936 ## Balanced Accuracy : 0.9331
## 'Positive' Class : No ## 'Positive' Class : No ## 'Positive' Class : No
Interpretation: By using logistic regression, Interpretation: By using KNN, we have Interpretation: By using Naive Bayes, we
we have predicted the employees (on test predicted the employees (on test data) who have predicted the employees (on test data)
data) who will be travelling by car with an will be travelling by car with an accuracy of who will be travelling by car with an accuracy
accuracy of close to 98% which is very good. over 90% which seems acceptable. of over 97% which seems very good.
35
e. Remarks on model validation
In section (3.d), we have created the confusion matrix for the three models i.e. logistic regression, KNN
and Naive Bayes. Let us now compare the three models based on Accuracy, Sensitivity and Specificity.
1) Accuracy: In terms of Accuracy, the Logistic regression model has worked the best with an accuracy
of close to 98%. Naïve Bayes also fares well with an accuracy of over 97% and KNN fairs satisfactorily
with an accuracy of over 90%.
2) Sensitivity: In terms of Sensitivity, the Naive Bayes model has worked the best with a sensitivity of
over 99%. Logistic regression also fares well with a sensitivity of over 98% and KNN fairs satisfactorily
with sensitivity over 91%.
3) Specificity: In terms of Specificity, the Logistic regression model has worked the best with a
sensitivity of over 95%. Naive Bayes also fares well with a Specificity of over 87% and KNN fairs
satisfactorily with Specificity over 87%.
Overall Verdict: Based on the above scores, we can say that the Logistic prediction model has worked
the best for this dataset in comparison to others.
In fact, Naive Bayes and Logistic Regression both match-up almost equally in their predictions. One
must acknowledge the effect of SMOTE while creating the training dataset. Indeed, if SMOTE was not
used, the model interpretation would have been quite different.
36
4) Prediction using Bagging and Boosting techniques
a. Bagging
i. Datasets for Bagging
# Training Data for bagging
smoted.cars_train.bagging <- smoted.cars_train
# Bagging model
cars.bagging <- bagging(Transport_car ~., data=smoted.cars_train.bagging,
control=rpart.control(maxdepth=5, minsplit=3), coob = TRUE)
cars.bagging
##
## Bagging classification trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = Transport_car ~ ., data =
smoted.cars_train.bagging,
## control = rpart.control(maxdepth = 5, minsplit = 3), coob = TRUE)
##
## Out-of-bag estimate of misclassification error: 0.0287
# Confusion Matrix
confusionMatrix(smoted.cars_train.bagging$Transport_car,smoted.cars_train.bagging$pred.cl
ass)
# Confusion Matrix
confusionMatrix(cars_test.bagging$Transport_car,cars_test.bagging$pred.class)
The accuracy of the bagging model on the Training data is 97.1% which is also very good.
38
b. Boosting - using GBM
i. Datasets for Boosting
# Training Data for boosting
smoted.cars_train.boosting <- smoted.cars_train
smoted.cars_train.boosting$Transport_car <-
ifelse(smoted.cars_train.boosting$Transport_car == "Yes",1,0)
#smoted.cars_train.boosting$Transport_car <-
as.factor(smoted.cars_train.boosting$Transport_car)
#?gbm()
cars.gbm <- gbm(
formula = Transport_car ~ .,
distribution = "bernoulli", #we are using bernoulli because we are doing a logistic and
want probabilities
data = smoted.cars_train.boosting,
n.trees = 10000, #these are the number of stumps
interaction.depth = 1,#number of splits it has to perform on a tree (starting from a
single node)
shrinkage = 0.001,#shrinkage is used for reducing, or shrinking the impact of each
additional fitted base-learner(tree)
cv.folds = 5,#cross validation folds
n.cores = NULL, # will use all cores by default
verbose = FALSE #after every tree/stump it is going to show the error and how it is
changing
)
#we have to put type="response" just like in logistic regression else we will have log
odds
smoted.cars_train.boosting$Transport_car <-
as.factor(smoted.cars_train.boosting$Transport_car)
smoted.cars_train.boosting$pred.class <- as.factor(smoted.cars_train.boosting$pred.class)
39
confusionMatrix(smoted.cars_train.boosting$Transport_car,smoted.cars_train.boosting$pred.
class)
#we have to put type="response" just like in logistic regression else we will have log
odds
confusionMatrix(cars_test.boosting$Transport_car,cars_test.boosting$pred.class)
The accuracy using boosting method on test data is 97.83% which is very good.
Comparing the Accuracy, Sensitivity and specificity measures of the classification matrices (on test data)
of all the models created so far, we can say that the boosting and logistic regression algorithm has
worked the best for this dataset.
However, one must consider the fact that we had balanced the target variable responder class using
SMOTE. The results would have been different if we had used the dataset as it is.
41
5) Actionable Insights and Recommendations
For this section, we will refer back to section 3.a. where we have derived the logistic regression model,
identified the significant variables and understood the odds explanatory power. Let us recall them here
as below: Considering the significant variables, the logistic model equation is as follows:
Log (odds of Car Transport) = -39.59 + 1.15 (Age) – 1.65 (MBA) + 0.35 (Distance) + 1.48 (License)
42