Sunteți pe pagina 1din 42

Machine Learning Group Assignment

Cars Transport Analysis


Submitted by:

 Suraj Ramkumar

 Rahul Godbole

 Harshvardhan Kadam

 Ankit Popat

 Ashish Srivastava

PGP BABI 2019-20 (Pune)

05/01/2020

1
TABLE OF CONTENTS

Introduction .............................................................................................................................................................................................. 3

1) Import the dataset and perform an exploratory analysis ........................................................................ 3

a. Basic data summary, Univariate, Bivariate analysis, graphs, outliers, missing values ................................... 3

b. Check for Multicollinearity ..................................................................................................................................................... 17

c. Interpretation of Business problem and observations ................................................................................................ 18

2) Data Preparation ................................................................................................................................................... 19

a. Split data into train and test .................................................................................................................................................. 19

b. SMOTE for balancing responder class ............................................................................................................................... 19

3) Prediction Models ................................................................................................................................................. 21

a. Logistic Regression ..................................................................................................................................................................... 21

b. K-Nearest Neighbor (KNN) Model ....................................................................................................................................... 30

c. Naive Bayes Model ...................................................................................................................................................................... 33

d. Confusion Matrix Interpretation .......................................................................................................................................... 35

e. Remarks on model validation ................................................................................................................................................ 36

4) Prediction using Bagging and Boosting techniques .................................................................................. 37

a. Bagging ........................................................................................................................................................................................... 37

b. Boosting - using GBM ................................................................................................................................................................ 39

c. Overall Best Model ...................................................................................................................................................................... 41

5) Actionable Insights and Recommendations ................................................................................................ 42

2
Introduction
The dataset provided as part of this assignment consists of personal details of employees and their
preferred mode of transport. We need to analyze the data using machine learning models and predict
whether or not an employee will use Car as a mode of transport. We also need to analyze to understand
which variables are significant predictors behind this decision.

1) Import the dataset and perform an exploratory analysis


a. Basic data summary, Univariate, Bivariate analysis, graphs, outliers, missing
values
Let us import the dataset from the working directory and understand the structure, get a basic summary
of the data and perform univariate and bivariate analysis.
# Set working directory
setwd("C:/Users/windows 7/Desktop/Great Lakes - PGP BABI/Machine Learning/Group
Assignment")

# import the dataset


cars <- read.csv("Cars.csv", header = TRUE)
dim(cars)

## [1] 444 9

There are 444 rows and 9 columns provided as part of the dataset.
i. Structure of Data
str(cars)

## 'data.frame': 444 obs. of 9 variables:


## $ Age : int 28 23 29 28 27 26 28 26 22 27 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
## $ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
## $ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
## $ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
## $ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
## $ license : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...

The columns ‘Engineer’, ‘MBA’ and ‘license’ need to be converted to factor variables. This is done below:
cars$Engineer <- as.factor(as.character(cars$Engineer))
cars$Engineer <- factor(cars$Engineer, labels = c("Non-Engineer","Engineer"))
cars$MBA <- as.factor(as.character(cars$MBA))
cars$MBA <- factor(cars$MBA, labels = c("Non-MBA","MBA"))
cars$license <- as.factor(as.character(cars$license))
cars$license <- factor(cars$license, labels = c("No license","Licensed"))
str(cars)

## 'data.frame': 444 obs. of 9 variables:


## $ Age : int 28 23 29 28 27 26 28 26 22 27 ...
3
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
## $ Engineer : Factor w/ 2 levels "Non-Engineer",..: 1 2 2 2 2 2 2 2 2 2 ...
## $ MBA : Factor w/ 2 levels "Non-MBA","MBA": 1 1 1 2 1 1 1 1 1 1 ...
## $ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
## $ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
## $ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
## $ license : Factor w/ 2 levels "No license","Licensed": 1 1 1 1 1 2 1 1 1 1 ...
## $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...

ii. Summary of Data


summary(cars)

## Age Gender Engineer MBA


## Min. :18.00 Female:128 Non-Engineer:109 Non-MBA:331
## 1st Qu.:25.00 Male :316 Engineer :335 MBA :112
## Median :27.00 NA's : 1
## Mean :27.75
## 3rd Qu.:30.00
## Max. :43.00
## Work.Exp Salary Distance license
## Min. : 0.0 Min. : 6.50 Min. : 3.20 No license:340
## 1st Qu.: 3.0 1st Qu.: 9.80 1st Qu.: 8.80 Licensed :104
## Median : 5.0 Median :13.60 Median :11.00
## Mean : 6.3 Mean :16.24 Mean :11.32
## 3rd Qu.: 8.0 3rd Qu.:15.72 3rd Qu.:13.43
## Max. :24.0 Max. :57.00 Max. :23.40
## Transport
## 2Wheeler : 83
## Car : 61
## Public Transport:300

iii. Missing Values


# Finding out number of missing values by each column
Missing_Values = sapply(cars, function(x)sum(is.na(x)))
data.frame(Missing_Values)

## Missing_Values
## Age 0
## Gender 0
## Engineer 0
## MBA 1
## Work.Exp 0
## Salary 0
## Distance 0
## license 0
## Transport 0

We observe there is a missing value in the column ‘MBA’ i.e. row 145. We can either impute the missing
data by the majority class i.e. 0 or we can impute using algorithms like KNN impute.
Let us impute this missing value using KNN imputation algorithm.

4
library(DMwR)
cars<-knnImputation(cars)
cars[145,]

## Age Gender Engineer MBA Work.Exp Salary Distance license


## 145 28 Female Non-Engineer Non-MBA 6 13.7 9.4 No license
## Transport
## 145 Public Transport

We have now imputed the missing data for this row (imputed value is 0).
iv. The Initial Hypothesis
Null Hypothesis (Ho) –No predictor is able to predict the mode of transport
Alternate Hypothesis (Ha)–At least one of the predictors is able to predict the mode of transport.

v. Univariate Analysis - Boxplots and outlier analysis


Let us conduct this analysis for each column as follows:
1) Transport_car
The column ‘Transport’ has three categories i.e. 2wheeler, Public Transport and Car. Since, we need to
focus on prediction of employee transport using Car only, let us create a new column (boolean) for
transport by car. Value will be 1 for Car and 0 for either 2wheeler or Public Transport.
cars$Transport_car <- ifelse(cars$Transport == "Car",1,0)
cars$Transport_car <- as.factor(as.character(cars$Transport_car))
cars$Transport_car <- factor(cars$Transport_car, labels = c("No","Yes"))
cars <- cars[,-9]
table(cars$Transport_car)

##
## No Yes
## 383 61

plot(cars$Transport_car, col="cyan", main="Barplot of car transport")

res.rate <- round(sum(cars$Transport_car=="Yes")/nrow(cars)*100,2)

The response rate for the target variable ‘Transport_Car’ is 13.74%.


5
2) Age
summary(cars$Age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 18.00 25.00 27.00 27.75 30.00 43.00

hist(cars$Age, col = "cyan", main = "Histogram of Age")


OutVals_age = boxplot(cars$Age, col = "cyan", main = "Boxplot of Age")$out

There are 25 outliers for the variable ‘Age’. These outliers are:
cars$Age[which(cars$Age %in% OutVals_age)]

## [1] 39 39 39 38 40 38 38 38 38 40 40 39 40 38 39 38 40 39 38 42 40 43 40
## [24] 38 39

Among all the values observed as outliers for ‘Age’, none of the values seem to be a typo or impossible
values. Hence, we will not treat these outliers.
3) Gender
summary(cars$Gender)

## Female Male
## 128 316

plot(cars$Gender, col="cyan", main = "Barplot of Gender")

6
gender_ratio <- round(sum(cars$Gender=="Male")/sum(cars$Gender=="Female"),2)

The male:female ratio is split at 2.47: 1.


4) Engineer
table(cars$Engineer)

##
## Non-Engineer Engineer
## 109 335

plot(cars$Engineer, col="cyan", main = "Barplot of Engineers v/s Non Engineers")

eng_ratio <- round(sum(cars$Engineer=="Engineer")/sum(cars$Engineer=="Non-Engineer"),2)

The Engineer:Non-Engineer ratio is split at 3.07: 1.


5) MBA
table(cars$MBA)

##
## Non-MBA MBA
## 332 112

plot(cars$MBA, col="cyan", main = "Barplot of MBAs v/s Non MBAs")

7
mba_ratio <- round(sum(cars$MBA=="MBA")/sum(cars$MBA=="Non-MBA"),2)

The MBA:non-MBA ratio is split at 0.34: 1.


6) Work Experience
summary(cars$Work.Exp)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.0 3.0 5.0 6.3 8.0 24.0

hist(cars$Work.Exp, col = "cyan", main = "Histogram of Work Exp")

OutVals_work.ex = boxplot(cars$Work.Exp, col = "cyan", main = "Boxplot of Work Exp")$out

There are 38 outliers identified for the variable ‘Work Exp’. These outliers are:
cars$Work.Exp[which(cars$Work.Exp %in% OutVals_work.ex)]

## [1] 19 16 21 17 16 18 19 18 21 16 19 19 18 19 20 22 16 20 18 21 20 20 16
## [24] 17 21 18 20 21 19 22 22 19 24 20 19 19 19 21

Among all the values observed as outliers for ‘Work Exp’, none of the values seem to be a typo or
impossible values. Hence, we will not treat these outliers.

8
7) Salary
summary(cars$Salary)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 6.50 9.80 13.60 16.24 15.72 57.00

OutVals_sal <- boxplot(cars$Salary, col = "cyan", main = "Boxplot of Salary")$out

hist(cars$Salary, col = "cyan", main = "Histogram of Salary")

.
There are 59 outliers identified for the variable ‘Salary’. These are:
cars$Salary[which(cars$Salary %in% OutVals_sal)]

## [1] 36.6 38.9 25.9 34.8 28.8 39.9 39.0 28.7 36.9 28.7 34.9 47.0 28.8 36.9
## [15] 54.0 29.9 34.9 36.0 44.0 37.0 24.9 43.0 37.0 54.0 44.0 34.0 48.0 42.0
## [29] 51.0 45.0 34.0 28.8 45.0 42.9 41.0 40.9 30.9 41.9 43.0 33.0 36.0 33.0
## [43] 38.0 46.0 45.0 48.0 35.0 51.0 51.0 55.0 45.0 42.0 52.0 38.0 57.0 44.0
## [57] 45.0 47.0 50.0

Among all the values observed as outliers for ‘Salary’, none of the values seem to be a typo or impossible
values. Hence, we will not treat these outliers.
8) Distance
summary(cars$Distance)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 3.20 8.80 11.00 11.32 13.43 23.40

hist(cars$Distance, main = "Histogram of Distance", col = "cyan")

OutVal_dist <- boxplot(cars$Distance, col = "cyan", main = "Boxplot of Distance")$out

9
.
There are 9 outliers identified for the variable ‘Distance’. These are:
cars$Distance[which(cars$Distance %in% OutVal_dist)]

## [1] 20.7 20.8 21.0 21.3 21.4 21.5 21.5 22.8 23.4

Among all the values observed as outliers for ‘Distance’, none of the values seem to be a typo or
impossible values. Hence, we will not treat these outliers.
9) License
table(cars$license)

##
## No license Licensed
## 340 104

plot(cars$license, col = "cyan", main = "Barplot of Licence and No License")

license_ratio <- round(sum(cars$license=="Licensed")/sum(cars$license=="No license"),2)

The License:no License ratio is split at 0.31: 1.

10
vi. Bivariate Analysis
Let us now check for bivariate analysis i.e. ‘Transport_car’ with respect to other variables.
1) Age and Transport_Car
boxplot(cars$Age~cars$Transport_car, col = "cyan", main = "Boxplot of Age and
Transport_car", xlab = "Transport_car", ylab = "Age")

Here, we observe that the average age of employees travelling by car is more than those who do not
travel by car.
2) Gender and Transport_Car
# Crosstab
gender.car<-table(cars$Gender,cars$Transport_car)
gender.car.totals<-addmargins(gender.car)
gender.car.totals

##
## No Yes Sum
## Female 115 13 128
## Male 268 48 316
## Sum 383 61 444

# sided barplot
barplot(gender.car,beside = T,xlab = "Car Transport",ylab="Count",legend =
rownames(gender.car),args.legend = list(x="topright"), col=c("pink","cyan"),main="Car
Transport by Gender")

11
Observations:
1) Car transport within Gender: The proportion of males and females travelling by car is significantly
less than those who do not travel by car.

2) Car transport by Gender: Likewise, we observe that more males travel by car as compared to their
female counterparts who travel by car.

3) Engineer and Transport_Car


# Crosstab
eng.car<-table(cars$Engineer,cars$Transport_car)
eng.car.totals<-addmargins(eng.car)
eng.car.totals

##
## No Yes Sum
## Non-Engineer 100 9 109
## Engineer 283 52 335
## Sum 383 61 444

# sided barplot
barplot(eng.car,beside = T,xlab = "Car Transport",ylab="Count",legend =
rownames(eng.car),args.legend = list(x="topright"), col=c("light green","cyan"),main="Car
Transport by Engineer")

12
Observations:
1) Car transport within Engineers: Majority number of Engineers does not travel by car.

2) Car transport by Engineers v/s non-Engineers: We observe that more Engineers travel by car as
compared to non-engineers who travel by car.

4) MBA and Transport_Car


# Crosstab
mba.car<-table(cars$MBA,cars$Transport_car)
mba.car.totals<-addmargins(mba.car)
mba.car.totals

##
## No Yes Sum
## Non-MBA 283 49 332
## MBA 100 12 112
## Sum 383 61 444

# sided barplot
barplot(mba.car,beside = T,xlab = "Car Transport",ylab="Count",legend =
rownames(mba.car),args.legend = list(x="topright"), col=c("light green","cyan"),main="Car
Transport by MBAs")

13
Observations:
1) Car transport within MBAs: Majority number of MBAs does not travel by car.

2) Car transport by MBAs v/s non-MBAs: We observe that more non-MBAs travel by car as compared
to MBAs who travel by car.

5) Work Experience and Transport_Car


boxplot(cars$Work.Exp~cars$Transport_car, col = "cyan", main = "Boxplot of Work Exp and
Transport_car", xlab = "Transport_car", ylab = "Years of Work Exp")

Here, we observe that on an average, employees with more years of work experience tend to travel by car
as compared to those with less years of work experience.

14
6) Salary and Transport_Car
boxplot(cars$Salary~cars$Transport_car, col = "cyan", main = "Boxplot of Salary and
Transport_car", xlab = "Transport_car", ylab = "Salary")

Here, we observe that on an average, employees with higher salary tend to travel by car as compared to
those with lesser salary. However, few with moderately higher salaries do not travel by car (captured as
outliers in Transport_car ‘No’).
7) Distance and Transport_Car
boxplot(cars$Distance~cars$Transport_car, col = "cyan", main = "Boxplot of Distance and
Transport_car", xlab = "Transport_car", ylab = "Distance")

Here, we observe that on an average, employees staying closer to office (less distance from office) prefer
not to travel by car as compared to those staying away(higher distance from office).
15
8) License and Transport_Car
# Crosstab
license.car<-table(cars$license,cars$Transport_car)
license.car.totals<-addmargins(license.car)
license.car.totals

##
## No Yes Sum
## No license 327 13 340
## Licensed 56 48 104
## Sum 383 61 444

# sided barplot
barplot(license.car,beside = T,xlab = "Car Transport",ylab="Count",legend =
rownames(license.car),args.legend = list(x="topright"), col=c("light
green","cyan"),main="Car Transport and License")

Observations:
Almost equal number of employees with license travel or do not travel by car.
Note: In the above figure, we see some employees without a license travel by car. However, thsi
interpretation is incorrect. These are the employees who have a license for 2 wheelers and not the car
(remember we create a separate column for car transport). OR They are the accompanying passengers in
the car transport.

16
b. Check for Multicollinearity
Let us create a correlation plot for the given dataset and understand which variables are highly
correlated.
library(corrplot)
cars.num <- cars[-9]
cars.num <- sapply(cars.num, as.numeric)
cars.cor <- cor(cars.num)
corrplot.mixed(cars.cor, main = "Correlation plot")

The observations from the above correlation plot:


1) The variables Age, Work Exp, Salary, Transport_car have high correlation. This would definitely
lead to multi-collinearity. We will evaluate this later as well using VIF.

2) The variables Distance and License have comparatively weaker correlations with other variables.

3) The variables Gender, Engineer and MBA have no correlations with other variables

17
c. Interpretation of Business problem and observations
In the given scenario, let us assume we are an automobile company who is looking for strategies to
improve sales and the objective of this exercise is to target employees of an organization to promote
automobile sales (assuming employer has obtained prior approval from employees to share their
personal details with the automobile company).
Business Problem: To increase automobile sales (while optimizing marketing/sales costs) by
targeting employees of organizations based on their personal, education and work related
characteristics.
The key observations (summary) based on exploratory analysis are as follows:

S/N Variable Observations


1 Car Transport (Dependent variable) Out of the sample observations available, the response rate /
percentage of employees who commute to work by car is 13.74%. The distribution of
responder class is unequal
2 Gender, Engineer, In the given sample, the proportion of males is higher than females, proportion of
MBAs, License engineers higher than non-engineers, proportion of non-MBAs higher than MBAs,
proportion of employees with no license is higher than licensed
3 Age, Work Exp, The average age, work exp, salary, distance to work is 27.75 yrs, 6.3 yrs, 13.6, 11.32
Salary, Distance to respectively
work
4 Age and Car The average age of employees travelling by car is more than those who do not travel
Transport by car
5 Gender and Car The proportion of males and females travelling by car is significantly less than those
Transport who do not travel by car. More males travel by car as compared to their female
counterparts who travel by car
6 Engineers and Car Majority of Engineers do not travel by car. More Engineers travel by car as compared
Transport to non-engineers who travel by car
7 MBAs and Car Majority number of MBAs does not travel by car. More non-MBAs travel by car as
Transport compared to MBAs who travel by car
8 Work Exp and Car Employees with more years of work experience tend to travel by car as compared to
Transport those with less years of work experience
9 Salary and Car Employees with higher salary tend to travel by car as compared to those with lesser
Transport salary. However, few with moderately higher salaries do not travel by car (captured
as outliers in Transport_car ‘No’)
10 Distance and Car Employees staying closer to office (less distance from office) prefer not to travel by
Transport car as compared to those staying away(higher distance from office)
11 License and Car Almost equal number of employees with license travel or do not travel by car
Transport
12 Correlation 1) Age, Work Exp, Salary, Transport_car have high correlation.
2) Distance & License have weaker correlations with other variables.
3) Gender, Engineer and MBA have no correlations with other variables

In the subsequent sections, we will create a predictive model based on logistic regression and machine
learning models to understand the odds of employees travelling by car based on different variables.

18
2) Data Preparation
In section 1, we have performed few aspects of data preparation i.e.
1) Conversion of data types to relevant formats (e.g. numeric to factor)

2) Treatment of missing values using KNN imputation method

a. Split data into train and test


In this step we will split the data into train and test data in an approximately 70:30 ratio.
set.seed(1000)
pd<-sample(2,nrow(cars),replace=TRUE, prob=c(0.7,0.3))

cars_train<-cars[pd==1,]
cars_test<-cars[pd==2,]

nrow(cars_train)/nrow(cars)

## [1] 0.6891892

nrow(cars_test)/nrow(cars)

## [1] 0.3108108

We will also check the target variable response rate in both the train and test data.
rr.train<-sum(cars_train$Transport_car == "Yes")/nrow(cars_train)
rr.test<-sum(cars_test$Transport_car == "Yes")/nrow(cars_test)

Response rate for Training data: 12.09%


Response rate for Test data: 17.39%
We observe that the response rate of the target variable / responder classis very low. We would need to
balance out the responder class proportion of the target column in the training data. Let us apply SMOTE
technique for the same.

b. SMOTE for balancing responder class


library(DMwR)
table(cars_train$Transport_car)

##
## No Yes
## 269 37

smoted.cars_train <- SMOTE(Transport_car ~ ., data=cars_train, perc.over = 350,


perc.under=150)
nrow(smoted.cars_train)

## [1] 314

table(smoted.cars_train$Transport_car)

19
##
## No Yes
## 166 148

prop.table(table(smoted.cars_train$Transport_car))

##
## No Yes
## 0.5286624 0.4713376

After applying SMOTE, we have now balanced the target variable responder class in the training data.

20
3) Prediction Models
a. Logistic Regression
Let us now apply logistic regression to the training dataset considering all the variables.
i. Train Model 1
#?glm()
m1<-glm(Transport_car ~.,data=smoted.cars_train, family="binomial")
summary(m1)

##
## Call:
## glm(formula = Transport_car ~ ., family = "binomial", data = smoted.cars_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.82195 -0.01496 -0.00003 0.02435 1.56132
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -88.62249 18.30654 -4.841 1.29e-06 ***
## Age 2.93564 0.61732 4.755 1.98e-06 ***
## GenderMale -0.15685 0.77881 -0.201 0.84039
## EngineerEngineer 0.07116 1.14323 0.062 0.95037
## MBAMBA -1.51164 0.80189 -1.885 0.05942 .
## Work.Exp -1.54290 0.38321 -4.026 5.67e-05 ***
## Salary 0.25046 0.07626 3.284 0.00102 **
## Distance 0.48587 0.17777 2.733 0.00627 **
## licenseLicensed 1.04909 0.72986 1.437 0.15061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.264 on 313 degrees of freedom
## Residual deviance: 61.917 on 305 degrees of freedom
## AIC: 79.917
##
## Number of Fisher Scoring iterations: 9

Interpretation:
The statistically significant variables according this model are:
1) Age
2) Work Exp
3) Salary
4) Distance
The AIC value of this model is 79.917
The Null deviance > Residual deviance implying that this model exists.
21
VIF: Model 1
library(car)
vif(m1)

## Age Gender Engineer MBA Work.Exp Salary Distance


## 8.872944 1.146314 1.080914 1.209013 12.239783 3.080634 1.471445
## license
## 1.129472

The VIF of Age, Work.Exp and Salary are high implying presence of multi-collinearity in the model. Let us
remove one of these variables and run the model again.
ii. Train Model 2
# Remove Work Exp variable
smoted.cars_train2 <- smoted.cars_train[,-5]

# Logistic Model
m2<-glm(Transport_car ~.,data=smoted.cars_train2, family="binomial")
summary(m2)

##
## Call:
## glm(formula = Transport_car ~ ., family = "binomial", data = smoted.cars_train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.93780 -0.08114 -0.00327 0.04801 2.42137
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.10441 6.77944 -5.768 8.02e-09 ***
## Age 1.12086 0.21381 5.242 1.58e-07 ***
## GenderMale -0.98921 0.65216 -1.517 0.12931
## EngineerEngineer 0.25594 0.89157 0.287 0.77406
## MBAMBA -1.61003 0.68234 -2.360 0.01830 *
## Salary 0.01248 0.04275 0.292 0.77036
## Distance 0.34922 0.11664 2.994 0.00275 **
## licenseLicensed 1.51176 0.62548 2.417 0.01565 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.264 on 313 degrees of freedom
## Residual deviance: 87.298 on 306 degrees of freedom
## AIC: 103.3
##
## Number of Fisher Scoring iterations: 8

The statistically significant variables for this model are:

22
1) Age
2) MBA
3) Distance
4) License
The AIC value is 103.57 and Null deviance > Residual Deviance implying that a model exists.
VIF: Model 2
vif(m2)

## Age Gender Engineer MBA Salary Distance license


## 1.519176 1.190020 1.090826 1.266171 1.280798 1.128442 1.195993

The VIF for all variables is acceptable. Let us proceed with this model for further validations.
iii. Log Regression Model 2: Model significance verification
1) Log Likelihood ratio test
library(lmtest)
lrtest(m2)

## Likelihood ratio test


##
## Model 1: Transport_car ~ Age + Gender + Engineer + MBA + Salary + Distance +
## license
## Model 2: Transport_car ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -43.649
## 2 1 -217.132 -7 346.97 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation
H0: All betas are zero
H1: Atleast 1 beta is non-zero
From the log likelihood, we can see that, intercept only model -217.132 variance was unknown to us.
When we take the model (m2), -43.649 variance was unknown to us. So we can say that, 1 – (-43.649 /-
217.132) = 79.89% of the uncertainty inherent in the intercept only model is calibrated by the model
(m2).
Chisq likelihood ratio is significant. Also the p value suggests that we can accept the Alternate Hypothesis
that at least one of the beta is not zero. So Model is significant.
2) McFadden’s pseudo R Square test
library(pscl)
Pseudo_m2<-pR2(m2)
Pseudo_m2

## llh llhNull G2 McFadden r2ML


## -43.6487809 -217.1320082 346.9664545 0.7989758 0.6687854

23
## r2CU
## 0.8926938

Interpretation: Based on McFadden R Square, we conclude that 79.9% of the uncertainty of the
Intercept only Model (Model 1) has been explained by the Full Model (Model 2). Thus the goodness of fit
is robust.
iv. Optimize LM using step() function
In this step, we will optimize the model m2 so as to consider the variables that result in the minimum
acceptable AIC.
m2 <- step(glm(Transport_car~.,data = smoted.cars_train2, family = "binomial"))

## Start: AIC=103.3
## Transport_car ~ Age + Gender + Engineer + MBA + Salary + Distance +
## license
##
## Df Deviance AIC
## - Engineer 1 87.380 101.38
## - Salary 1 87.384 101.38
## <none> 87.298 103.30
## - Gender 1 89.719 103.72
## - MBA 1 93.485 107.48
## - license 1 93.579 107.58
## - Distance 1 98.061 112.06
## - Age 1 154.710 168.71
##
## Step: AIC=101.38
## Transport_car ~ Age + Gender + MBA + Salary + Distance + license
##
## Df Deviance AIC
## - Salary 1 87.488 99.488
## <none> 87.380 101.380
## - Gender 1 89.763 101.763
## - license 1 93.626 105.626
## - MBA 1 93.946 105.946
## - Distance 1 98.065 110.065
## - Age 1 155.363 167.363
##
## Step: AIC=99.49
## Transport_car ~ Age + Gender + MBA + Distance + license
##
## Df Deviance AIC
## <none> 87.488 99.488
## - Gender 1 89.781 99.781
## - license 1 93.864 103.864
## - MBA 1 94.050 104.050
## - Distance 1 98.946 108.946
## - Age 1 252.255 262.255

summary(m2)

##
## Call:
24
## glm(formula = Transport_car ~ Age + Gender + MBA + Distance +
## license, family = "binomial", data = smoted.cars_train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.92044 -0.07972 -0.00299 0.04800 2.43207
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.5960 6.4534 -6.136 0.000000000848 ***
## Age 1.1496 0.1992 5.770 0.000000007907 ***
## GenderMale -0.9488 0.6406 -1.481 0.13861
## MBAMBA -1.6461 0.6786 -2.426 0.01527 *
## Distance 0.3539 0.1151 3.075 0.00211 **
## licenseLicensed 1.4852 0.6105 2.433 0.01498 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.264 on 313 degrees of freedom
## Residual deviance: 87.488 on 308 degrees of freedom
## AIC: 99.488
##
## Number of Fisher Scoring iterations: 8

pR2(m2)

## llh llhNull G2 McFadden r2ML


## -43.7438202 -217.1320082 346.7763761 0.7985381 0.6685848
## r2CU
## 0.8924261

Interpretation: Based on McFadden R Square for the optimized model, we conclude that 79.85% of the
uncertainty of the Intercept only Model (Model 1) has been explained by the Full Model (Model 2). Thus
the goodness of fit is robust.
v. Odds Explanatory Power
Let’s find out the power of Odds and Probability of the variables impacting employee’s decision to
commute by car.
round(exp(coef(m2)),4) # Odds Ratio

## (Intercept) Age GenderMale


## 0.0000 2.8166 0.2055
## MBAMBA Distance licenseLicensed
## 0.1446 1.4097 4.0904

round(exp(coef(m2))/(1+exp(coef(m2))),4) # Probability

## (Intercept) Age GenderMale


## 0.0000 0.7380 0.1705
## MBAMBA Distance licenseLicensed
## 0.1263 0.5850 0.8036
25
Interpretation
If a particular Variable as shown in above table is increased by ‘One Unit’, the odds of employees
travelling by car (Vs. not travelling by car) and the respective probability is shown in the following table
(for numeric variables, consider Age, Salary, Distance).

# Odds Ratio
odds_m2=exp(coef(m2))
probability=odds_m2/(1+odds_m2)
sum_odds_m2<-odds_m2[2]+odds_m2[3]+odds_m2[4]+odds_m2[5]+odds_m2[6]+odds_m2[7]+odds_m2[8]
varImp<-odds_m2/sum_odds_m2*100
odds_mat<-data.frame(odds_m2,probability,varImp)
odds_mat$probability <- odds_mat$probability*100
options(scipen=999)
round(odds_mat, 2)

## odds_m2 probability varImp


## (Intercept) 0.00 0.00 0.00
## Age 2.82 73.80 32.50
## GenderMale 0.21 17.05 2.37
## MBAMBA 0.14 12.63 1.67
## Distance 1.41 58.50 16.27
## licenseLicensed 4.09 80.36 47.20

For Categorical Variables i.e. Age, Gender, Engineer, MBA, License,


1) When the gender is Female the odds that the employee will travel by car is 0.21 compared to when
the gender is Male.

2) When the employee is an MBA, the odds of opting for car transport is 0.2 as compared to when the
employee is not an MBA.

3) When the employee has a driving license, the odds of opting for car transport is 4.53 as compared to
when the employee does not have a driving license.

vi. Prediction using model m2


1) Prediction on Training Data
smoted.cars_train2$prob_m2<-predict(m2,type = "response")
smoted.cars_train2$class_m2<-floor(smoted.cars_train2$prob_m2+0.5)

# convert to factor
smoted.cars_train2$class_m2<-factor(smoted.cars_train2$class_m2, labels = c("No","Yes"))

# Confusion matrix for model m2 on training data


library(caret)
confusionMatrix(smoted.cars_train2$class_m2,smoted.cars_train2$Transport_car)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 159 9
26
## Yes 7 139
##
## Accuracy : 0.949
## 95% CI : (0.9186, 0.9706)
## No Information Rate : 0.5287
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.8977
##
## Mcnemar's Test P-Value : 0.8026
##
## Sensitivity : 0.9578
## Specificity : 0.9392
## Pos Pred Value : 0.9464
## Neg Pred Value : 0.9521
## Prevalence : 0.5287
## Detection Rate : 0.5064
## Detection Prevalence : 0.5350
## Balanced Accuracy : 0.9485
##
## 'Positive' Class : No
##

2) Prediction on Test Data


cars_test$prob_m2<-predict(m2,newdata = cars_test,type = "response")
cars_test$class_m2<-floor(cars_test$prob_m2+0.5)

# convert to factor
cars_test$class_m2<-factor(cars_test$class_m2, labels = c("No","Yes"))

# Confusion matrix for model m2 on test data


confusionMatrix(cars_test$class_m2,cars_test$Transport_car)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 112 1
## Yes 2 23
##
## Accuracy : 0.9783
## 95% CI : (0.9378, 0.9955)
## No Information Rate : 0.8261
## P-Value [Acc > NIR] : 0.00000001576
##
## Kappa : 0.9256
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9825
## Specificity : 0.9583
## Pos Pred Value : 0.9912
## Neg Pred Value : 0.9200
27
## Prevalence : 0.8261
## Detection Rate : 0.8116
## Detection Prevalence : 0.8188
## Balanced Accuracy : 0.9704
##
## 'Positive' Class : No

Accuracy of close to 98% on test data seems very good.


vii. Model Performance – ROC plot, KS, AUC, GINI
1) ROC plot for model m2
# ROC plot for Model m2

library(Deducer)
rocplot(m2)

2) KS, AUC, Gini for model m2


On Training Data
## Model 2
# KS, Gini and AUC for model m2
library(ROCR)
library(ineq)

#AUC for m2
pred_m2_train <- prediction(smoted.cars_train2$prob_m2,smoted.cars_train2$Transport_car)
perf_m2_train <- performance(pred_m2_train, "tpr", "fpr")

auc_m2_train <- performance(pred_m2_train,"auc")


28
auc_m2_train <- round(as.numeric(auc_m2_train@y.values),4)*100
auc_m2_train

## [1] 98.81

# KS for m2
KS_m2_train <- round(max(attr(perf_m2_train, 'y.values')[[1]]-attr(perf_m2_train,
'x.values')[[1]]),4)*100
KS_m2_train

## [1] 91.05

# Gini for m2
gini_m2_train = round(ineq(smoted.cars_train2$prob_m2, type="Gini"),4)*100
gini_m2_train

## [1] 51.61

On Test Data
## Model 2
# KS, Gini and AUC for model m2
#library(ROCR)
#library(ineq)

#AUC for m2
pred_m2_test <- prediction(cars_test$prob_m2,cars_test$Transport_car)
perf_m2_test <- performance(pred_m2_test, "tpr", "fpr")

auc_m2_test <- performance(pred_m2_test,"auc")


auc_m2_test <- round(as.numeric(auc_m2_test@y.values),4)*100
auc_m2_test

## [1] 99.52

# KS for m2
KS_m2_test <- round(max(attr(perf_m2_test, 'y.values')[[1]]-attr(perf_m2_test,
'x.values')[[1]]),4)*100
KS_m2_test

## [1] 94.96

# Gini for m2
gini_m2_test = round(ineq(cars_test$prob_m2, type="Gini"),4)*100
gini_m2_test

## [1] 79.4

The model validation metrics (KS, AUC, Gini) obtained imply that this is a robust logistic regression
model.

29
b. K-Nearest Neighbor (KNN) Model
Let us create Train and Test data to create KNN model. We will use the same train and test data as used
for logistic regression.
# Create Train and Test Data for KNN
smoted.cars_train.knn <- smoted.cars_train
cars_test.knn <- cars_test[,-c(10,11)]

i. Convert all factor variables to numeric


# Training data
smoted.cars_train.knn$Gender <- as.numeric(smoted.cars_train.knn$Gender)
smoted.cars_train.knn$Engineer <- as.numeric(smoted.cars_train.knn$Engineer)
smoted.cars_train.knn$MBA <- as.numeric(smoted.cars_train.knn$MBA)
smoted.cars_train.knn$license <- as.numeric(smoted.cars_train.knn$license)
smoted.cars_train.knn$Transport_car <- as.numeric(smoted.cars_train.knn$Transport_car)
str(smoted.cars_train.knn)

## 'data.frame': 314 obs. of 9 variables:


## $ Age : num 28 28 24 29 28 25 25 26 24 30 ...
## $ Gender : num 2 2 2 2 2 1 2 2 2 1 ...
## $ Engineer : num 2 2 1 1 1 2 2 2 2 2 ...
## $ MBA : num 2 2 1 1 1 1 1 1 1 1 ...
## $ Work.Exp : num 5 7 2 5 5 1 1 4 4 8 ...
## $ Salary : num 14.8 13.6 8.5 14.8 14.9 8.9 8.6 13 8.5 14.6 ...
## $ Distance : num 10.8 6.3 6.2 15.4 12.5 16.8 9.7 19.1 7.5 6.5 ...
## $ license : num 2 1 1 1 2 1 1 2 1 1 ...
## $ Transport_car: num 1 1 1 1 1 1 1 1 1 1 ...

# Test data
cars_test.knn$Gender <- as.numeric(cars_test.knn$Gender)
cars_test.knn$Engineer <- as.numeric(cars_test.knn$Engineer)
cars_test.knn$MBA <- as.numeric(cars_test.knn$MBA)
cars_test.knn$license <- as.numeric(cars_test.knn$license)
cars_test.knn$Transport_car <- as.numeric(cars_test.knn$Transport_car)
str(cars_test.knn)

## 'data.frame': 138 obs. of 9 variables:


## $ Age : int 23 28 27 27 32 24 25 27 27 25 ...
## $ Gender : num 1 2 2 2 2 2 2 1 2 1 ...
## $ Engineer : num 2 2 2 2 2 2 1 2 2 1 ...
## $ MBA : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Work.Exp : int 4 5 4 4 9 6 1 5 6 3 ...
## $ Salary : num 8.3 14.4 13.5 13.4 15.5 10.6 7.6 12.5 12.6 9.6 ...
## $ Distance : num 3.3 5.1 5.3 5.5 5.5 6.1 6.3 6.4 6.5 6.7 ...
## $ license : num 1 1 2 2 1 1 1 1 1 1 ...
## $ Transport_car: num 1 1 1 1 1 1 1 1 1 1 ...

ii. Normalize the Dataset


Let us normalize the numeric variables of the dataset using a function, as follows:
# create normalization function
# The below technique is called as min-max normalization
30
normalize <- function(x){
z<-((x-min(x))/(max(x)-min(x)))
return(z)
}

# As the dataframe is equal-length vectors, we can use lappy()


# to apply normalize() to each feature in the dataframe

listdata <- lapply(smoted.cars_train.knn, normalize)


listdata.test <- lapply(cars_test.knn, normalize)

# Convert the output of the lapply to a data frame


smoted.cars_train.knn.norm <- as.data.frame(listdata)
cars_test.knn.norm <- as.data.frame(listdata.test)

iii. Separate out the Target labels


cars_trainlabels <- smoted.cars_train.knn.norm[,9]
cars_testlabels <- cars_test.knn.norm[,9]

# Remove the Target labels in Train and test

smoted.cars_train.knn.norm <- smoted.cars_train.knn.norm[,-9]


cars_test.knn.norm <- cars_test.knn.norm[,-9]

iv. Predict test data using KNN model


Let us now train the model using KNN. Let us assume initial k = 3.
library(class)

cars_testlabels_pred <- knn(train = smoted.cars_train.knn.norm, test =


cars_test.knn.norm, cl = cars_trainlabels, k=3)

v. Evaluate the model on test data


# Convert to Factor type
cars_testlabels <- as.factor(cars_testlabels)
cars_testlabels_pred <- factor(cars_testlabels_pred, labels = c("No","Yes"))

# Confusion Matrix for KNN


confusionMatrix(cars_testlabels_pred,cars_testlabels)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 104 3
## Yes 10 21
##
## Accuracy : 0.9058
## 95% CI : (0.8443, 0.9489)
## No Information Rate : 0.8261
## P-Value [Acc > NIR] : 0.006168
##
31
## Kappa : 0.706
##
## Mcnemar's Test P-Value : 0.096092
##
## Sensitivity : 0.9123
## Specificity : 0.8750
## Pos Pred Value : 0.9720
## Neg Pred Value : 0.6774
## Prevalence : 0.8261
## Detection Rate : 0.7536
## Detection Prevalence : 0.7754
## Balanced Accuracy : 0.8936
##
## 'Positive' Class : No
##

Interpretation
By using KNN, we have predicted the employees (on test data) who will be travelling by car with an
accuracy of over 91% which seems acceptable.

32
c. Naive Bayes Model
i. Applicability of Naive Bayes on this dataset
Naive Bayes algorithm can be used for classification where the target variable responder class is of the
nature 0/1 or Yes/No. In the original form, the Target variable (Transport) responder class had 3 classes
i.e. ‘2Wheeler’, ‘Car’ and ‘Public Transport’. Hence, Naive Bayes cannot be applied directly to predict this
column.
Instead, the column can be transformed. Since, we need to focus on prediction of employee transport
using Car only, let us create a new column (Boolean) for transport by car. Value will be 1 for Car and 0 for
either 2wheeler or Public Transport. This form can then be applied for Naive Bayes prediction model.
This transformation has been done in section (1.a) of the assignment. We will now proceed with the
remaining steps of Naive Bayes for model building and prediction.
ii. Data for Naive Bayes
Let us create Train and Test data to create Naive Bayes model. We will use the same train and test data as
used for logistic regression.
# Create Train and Test Data for Naive Bayes
smoted.cars_train.nb <- smoted.cars_train
cars_test.nb <- cars_test[,-c(10,11)]

iii. Separate out the Target labels


cars_trainlabels.nb <- smoted.cars_train.nb[,9]
cars_testlabels <- cars_test.nb[,9]

# Remove the Target labels in Train and test

#smoted.cars_train.nb <- smoted.cars_train.nb[,-9]


#cars_test.nb <- cars_test.nb[,-9]

iv. Train Naive Bayes model


Let us now train the model using Naive Bayes.
library(e1071)

nb1 <- naiveBayes(smoted.cars_train.nb[,-9], smoted.cars_train.nb[,9])

v. Predict using Naive Bayes model - Training data


smoted.cars_train.nb$pred_nb1 <- predict(nb1, smoted.cars_train.nb[,-9])

# Confusion Matrix
confusionMatrix(smoted.cars_train.nb$pred_nb1,smoted.cars_train.nb$Transport_car)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 158 21
33
## Yes 8 127
## Accuracy : 0.9076
## 95% CI : (0.8701, 0.9373)
## No Information Rate : 0.5287
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.8138
##
## Mcnemar's Test P-Value : 0.02586
##
## Sensitivity : 0.9518
## Specificity : 0.8581
## Pos Pred Value : 0.8827
## Neg Pred Value : 0.9407
## Prevalence : 0.5287
## Detection Rate : 0.5032
## Detection Prevalence : 0.5701
## Balanced Accuracy : 0.9050
##
## 'Positive' Class : No

vi. Predict using Naive Bayes model - Test data


cars_test.nb$pred_nb1 <- predict(nb1, cars_test.nb[,-9])

# Confusion Matrix for Naive Bayes


confusionMatrix(cars_test.nb$pred_nb1,cars_test.nb$Transport_car)

## Confusion Matrix and Statistics


## Reference
## Prediction No Yes
## No 113 3
## Yes 1 21
## Accuracy : 0.971
## 95% CI : (0.9274, 0.992)
## No Information Rate : 0.8261
## P-Value [Acc > NIR] : 0.0000001165
##
## Kappa : 0.8957
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.9912
## Specificity : 0.8750
## Pos Pred Value : 0.9741
## Neg Pred Value : 0.9545
## Prevalence : 0.8261
## Detection Rate : 0.8188
## Detection Prevalence : 0.8406
## Balanced Accuracy : 0.9331
## 'Positive' Class : No

Interpretation
By using Naive Bayes, we have predicted the employees (on test data) who will be travelling by car with
an accuracy of over 97% which seems very good.
34
d. Confusion Matrix Interpretation
In the above sections, we have created prediction models using Logistic regression, KNN and Naive Bayes model. Let us look at the confusion
matrices of the three models on the basis of test data and interpret them.
Logistic Regression - Model m2 KNN Naive Bayes
# Confusion matrix for model m2 on # Confusion Matrix for KNN # Confusion Matrix for Naive Bayes
test data
confusionMatrix(cars_test$class_m2,car confusionMatrix(cars_testlabels_pred confusionMatrix(cars_test.nb$pred_nb1,c
s_test$Transport_car) ,cars_testlabels) ars_test.nb$Transport_car)
## Confusion Matrix and Statistics ## Confusion Matrix and Statistics ## Confusion Matrix and Statistics
## ## ##
## Reference ## Reference ## Reference
## Prediction No Yes ## Prediction No Yes ## Prediction No Yes
## No 112 1 ## No 104 3 ## No 113 3
## Yes 2 23 ## Yes 10 21 ## Yes 1 21
## ## ##
## Accuracy : 0.9783 ## Accuracy : 0.9058 ## Accuracy : 0.971
## 95% CI : (0.9378, 0.9955) ## 95% CI : (0.8443, 0.9489) ## 95% CI : (0.9274, 0.992)
## No Information Rate : 0.8261 ## No Information Rate : 0.8261 ## No Information Rate : 0.8261
## P-Value [Acc > NIR] : 0.00000001576 ## P-Value [Acc > NIR] : 0.006168 ## P-Value [Acc > NIR] : 0.0000001165
## ## ##
## Kappa : 0.9256 ## Kappa : 0.706 ## Kappa : 0.8957
## Mcnemar's Test P-Value : 1 ## Mcnemar's Test P-Value : 0.096092 ## Mcnemar's Test P-Value : 0.6171
## ## ##
## Sensitivity : 0.9825 ## Sensitivity : 0.9123 ## Sensitivity : 0.9912
## Specificity : 0.9583 ## Specificity : 0.8750 ## Specificity : 0.8750
## Pos Pred Value : 0.9912 ## Pos Pred Value : 0.9720 ## Pos Pred Value : 0.9741
## Neg Pred Value : 0.9200 ## Neg Pred Value : 0.6774 ## Neg Pred Value : 0.9545
## Prevalence : 0.8261 ## Prevalence : 0.8261 ## Prevalence : 0.8261
## Detection Rate : 0.8116 ## Detection Rate : 0.7536 ## Detection Rate : 0.8188
## Detection Prevalence : 0.8188 ## Detection Prevalence : 0.7754 ## Detection Prevalence : 0.8406
## Balanced Accuracy : 0.9704 ## Balanced Accuracy : 0.8936 ## Balanced Accuracy : 0.9331
## 'Positive' Class : No ## 'Positive' Class : No ## 'Positive' Class : No
Interpretation: By using logistic regression, Interpretation: By using KNN, we have Interpretation: By using Naive Bayes, we
we have predicted the employees (on test predicted the employees (on test data) who have predicted the employees (on test data)
data) who will be travelling by car with an will be travelling by car with an accuracy of who will be travelling by car with an accuracy
accuracy of close to 98% which is very good. over 90% which seems acceptable. of over 97% which seems very good.
35
e. Remarks on model validation
In section (3.d), we have created the confusion matrix for the three models i.e. logistic regression, KNN
and Naive Bayes. Let us now compare the three models based on Accuracy, Sensitivity and Specificity.
1) Accuracy: In terms of Accuracy, the Logistic regression model has worked the best with an accuracy
of close to 98%. Naïve Bayes also fares well with an accuracy of over 97% and KNN fairs satisfactorily
with an accuracy of over 90%.
2) Sensitivity: In terms of Sensitivity, the Naive Bayes model has worked the best with a sensitivity of
over 99%. Logistic regression also fares well with a sensitivity of over 98% and KNN fairs satisfactorily
with sensitivity over 91%.
3) Specificity: In terms of Specificity, the Logistic regression model has worked the best with a
sensitivity of over 95%. Naive Bayes also fares well with a Specificity of over 87% and KNN fairs
satisfactorily with Specificity over 87%.
Overall Verdict: Based on the above scores, we can say that the Logistic prediction model has worked
the best for this dataset in comparison to others.
In fact, Naive Bayes and Logistic Regression both match-up almost equally in their predictions. One
must acknowledge the effect of SMOTE while creating the training dataset. Indeed, if SMOTE was not
used, the model interpretation would have been quite different.

36
4) Prediction using Bagging and Boosting techniques
a. Bagging
i. Datasets for Bagging
# Training Data for bagging
smoted.cars_train.bagging <- smoted.cars_train

# Test Data for Bagging


cars_test.bagging <- cars_test

ii. Train model using Bagging


library(ipred)
library(rpart)

# Bagging model
cars.bagging <- bagging(Transport_car ~., data=smoted.cars_train.bagging,
control=rpart.control(maxdepth=5, minsplit=3), coob = TRUE)

cars.bagging

##
## Bagging classification trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = Transport_car ~ ., data =
smoted.cars_train.bagging,
## control = rpart.control(maxdepth = 5, minsplit = 3), coob = TRUE)
##
## Out-of-bag estimate of misclassification error: 0.0287

iii. Predict using Training data


smoted.cars_train.bagging$pred.class <- predict(cars.bagging, smoted.cars_train.bagging)

# Confusion Matrix
confusionMatrix(smoted.cars_train.bagging$Transport_car,smoted.cars_train.bagging$pred.cl
ass)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 161 5
## Yes 0 148
##
## Accuracy : 0.9841
## 95% CI : (0.9632, 0.9948)
## No Information Rate : 0.5127
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.9681
##
37
## Mcnemar's Test P-Value : 0.07364
##
## Sensitivity : 1.0000
## Specificity : 0.9673
## Pos Pred Value : 0.9699
## Neg Pred Value : 1.0000
## Prevalence : 0.5127
## Detection Rate : 0.5127
## Detection Prevalence : 0.5287
## Balanced Accuracy : 0.9837
##
## 'Positive' Class : No
##

The accuracy of the bagging model on the Training data is 98.4%.


iv. Predict using Testing data
cars_test.bagging$pred.class <- predict(cars.bagging, cars_test.bagging)

# Confusion Matrix
confusionMatrix(cars_test.bagging$Transport_car,cars_test.bagging$pred.class)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 112 2
## Yes 2 22
##
## Accuracy : 0.971
## 95% CI : (0.9274, 0.992)
## No Information Rate : 0.8261
## P-Value [Acc > NIR] : 0.0000001165
##
## Kappa : 0.8991
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9825
## Specificity : 0.9167
## Pos Pred Value : 0.9825
## Neg Pred Value : 0.9167
## Prevalence : 0.8261
## Detection Rate : 0.8116
## Detection Prevalence : 0.8261
## Balanced Accuracy : 0.9496
##
## 'Positive' Class : No
##

The accuracy of the bagging model on the Training data is 97.1% which is also very good.

38
b. Boosting - using GBM
i. Datasets for Boosting
# Training Data for boosting
smoted.cars_train.boosting <- smoted.cars_train
smoted.cars_train.boosting$Transport_car <-
ifelse(smoted.cars_train.boosting$Transport_car == "Yes",1,0)

# Test Data for Boosting


cars_test.boosting <- cars_test
cars_test.boosting$Transport_car <- ifelse(cars_test.boosting$Transport_car == "Yes",1,0)

ii. Train the boosting model


library(gbm)

## Warning: package 'gbm' was built under R version 3.6.2

## Loaded gbm 2.1.5

#smoted.cars_train.boosting$Transport_car <-
as.factor(smoted.cars_train.boosting$Transport_car)
#?gbm()
cars.gbm <- gbm(
formula = Transport_car ~ .,
distribution = "bernoulli", #we are using bernoulli because we are doing a logistic and
want probabilities
data = smoted.cars_train.boosting,
n.trees = 10000, #these are the number of stumps
interaction.depth = 1,#number of splits it has to perform on a tree (starting from a
single node)
shrinkage = 0.001,#shrinkage is used for reducing, or shrinking the impact of each
additional fitted base-learner(tree)
cv.folds = 5,#cross validation folds
n.cores = NULL, # will use all cores by default
verbose = FALSE #after every tree/stump it is going to show the error and how it is
changing
)

iii. Predict on Training Data


smoted.cars_train.boosting$pred.class <- predict(cars.gbm, smoted.cars_train.boosting,
type = "response")

## Using 9998 trees...

#we have to put type="response" just like in logistic regression else we will have log
odds

smoted.cars_train.boosting$pred.class <- floor(smoted.cars_train.boosting$pred.class+0.5)

smoted.cars_train.boosting$Transport_car <-
as.factor(smoted.cars_train.boosting$Transport_car)
smoted.cars_train.boosting$pred.class <- as.factor(smoted.cars_train.boosting$pred.class)
39
confusionMatrix(smoted.cars_train.boosting$Transport_car,smoted.cars_train.boosting$pred.
class)

## Confusion Matrix and Statistics


##
## Reference
## Prediction 0 1
## 0 161 5
## 1 0 148
##
## Accuracy : 0.9841
## 95% CI : (0.9632, 0.9948)
## No Information Rate : 0.5127
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.9681
##
## Mcnemar's Test P-Value : 0.07364
##
## Sensitivity : 1.0000
## Specificity : 0.9673
## Pos Pred Value : 0.9699
## Neg Pred Value : 1.0000
## Prevalence : 0.5127
## Detection Rate : 0.5127
## Detection Prevalence : 0.5287
## Balanced Accuracy : 0.9837
##
## 'Positive' Class : 0
##

The accuracy with gbm boosting is 98.4% on training data.


iv. Predict on Test Data
cars_test.boosting$pred.class <- predict(cars.gbm, cars_test.boosting, type = "response")

## Using 9998 trees...

#we have to put type="response" just like in logistic regression else we will have log
odds

cars_test.boosting$pred.class <- floor(cars_test.boosting$pred.class+0.5)

cars_test.boosting$Transport_car <- as.factor(cars_test.boosting$Transport_car)


cars_test.boosting$pred.class <- as.factor(cars_test.boosting$pred.class)

confusionMatrix(cars_test.boosting$Transport_car,cars_test.boosting$pred.class)

## Confusion Matrix and Statistics


##
## Reference
## Prediction 0 1
## 0 113 1
40
## 1 2 22
##
## Accuracy : 0.9783
## 95% CI : (0.9378, 0.9955)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 0.00000004537
##
## Kappa : 0.9231
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9826
## Specificity : 0.9565
## Pos Pred Value : 0.9912
## Neg Pred Value : 0.9167
## Prevalence : 0.8333
## Detection Rate : 0.8188
## Detection Prevalence : 0.8261
## Balanced Accuracy : 0.9696
##
## 'Positive' Class : 0
##

The accuracy using boosting method on test data is 97.83% which is very good.

c. Overall Best Model


The below table provides a comparative view of the confusion matrix measures (i.e. Accuracy, Sensitivity,
Specificity) of all the models, using the test data. The ones highlighted in yellow are the best performing
models.

Model/Measure Logistic (M2) KNN Naïve Bayes Bagging Boosting


Accuracy (%) 97.83 90.58 97.1 97.1 97.83
Sensitivity (%) 98.25 91.23 99.12 98.25 98.26
Specificity (%) 95.83 87.5 87.5 91.67 95.65

Comparing the Accuracy, Sensitivity and specificity measures of the classification matrices (on test data)
of all the models created so far, we can say that the boosting and logistic regression algorithm has
worked the best for this dataset.
However, one must consider the fact that we had balanced the target variable responder class using
SMOTE. The results would have been different if we had used the dataset as it is.

41
5) Actionable Insights and Recommendations
For this section, we will refer back to section 3.a. where we have derived the logistic regression model,
identified the significant variables and understood the odds explanatory power. Let us recall them here
as below: Considering the significant variables, the logistic model equation is as follows:

Log (odds of Car Transport) = -39.59 + 1.15 (Age) – 1.65 (MBA) + 0.35 (Distance) + 1.48 (License)

The odds explanatory power is provided below:


## odds_m2 probability varImp
## (Intercept) 0.00 0.00 0.00
## Age 2.82 73.80 32.50
## MBAMBA 0.14 12.63 1.67
## Distance 1.41 58.50 16.27
## licenseLicensed 4.09 80.36 47.20

The final interpretation is as follows:


(The objective & assumption we have here is that we are an automobile company look for a strategy to
promote sales but at minimized sales and marketing expenses. Also, we are privy to employee personal
information as mentioned above through their employer).
1) Age: As age increased by one unit, the odds of the employee commuting by car increased by 2.82
and the corresponding probability increases by 73.80%. As we saw in the EDA, the mean age of an
employee using Car transport is around 35 years, 1st Quartile of around 32 years.
2) MBA: As per the data provided, the odds of an employee, who is also an MBA, travelling by car is
only 0.14 and the probability is 12.63%. Also, we saw in the EDA, we had more non-MBAs travelling
by car as compared to MBAs.
3) Distance: As the distance increases by 1 unit, the odds of the employee commuting by car increases
by 1.41 and the corresponding probability increases by 58.50%. Also, as we saw in the EDA, the
mean distance at which an employee uses to travel by car is 15km and 1st quartile of around 12 km.
4) License: As one has a valid driving license, the odds of travelling by car is 4.09 and the
corresponding probability is 80.36%.
5) Based on the above, the automobile company can try and target the employees based on the below
profile.
Measure Criteria
Age Equal or above 32 years
Qualification Non – MBA
Distance to commute Equal to or greater than 12 km
License Valid driving license available
6) If such employees are not already commuting by car, there is a high probability that they are
thinking about moving to car transport & these employees should be targeted by the auto company.
7) If such employees are commuting by car, the automobile company can check if they would be
interested in a vehicle upgrade, they can provide buy-back schemes so that these employees get a
good deal.

42

S-ar putea să vă placă și