Sunteți pe pagina 1din 27

Great Lakes Institute of Management, Chennai

PGPM 2019-2020

Business Analytics Project

Topic: Predicting Income levels using demographics

Group-1-Section-1

Aakar Gangrade FT201001


Arpit Agarwal FT201019
Avishek Sharma FT201026
Abhishek Kumar Singh FT201007
Disha Ghoshal FT201032
Anand S FT201013
Introduction:
The annual income of an adult is based on various factors and in this study we use
personal demographic data such as age, education level, gender, occupation and
other variables. We are working on this to predict the salary class to which a
person belongs based on these demographic variables. The predictions are possible
with the help of decision models and decision trees built over the dataset selected
and the attributes in the dataset as predictor variables.

Data Preparation:
 Data Source:

The data is based on the personal data of UCI repository owned by


US census bureau collected and compiled during the year 1996. The data is
taken from www.kaggle.com. Please find the dataset link below.
https://www.kaggle.com/flyingwombat/logistic-regression-with-uci-adult-income/data

 Data Description:

The dataset is a set of 15 columns out of which 14 attributes defines


the demographics and personal characteristics. These are independent
variables and the dependent variable is the Income class of the person.
INDEPENDENT VARIABLES:

o Age
o Work class – organization of the employer
o Fnlwgt – Final weight
o Education – level of education of the individual
o Education-num – Score given to education level (categorical to
continuous)
o Marital status
o Occupation
o Relationship
o Race
o Gender
o Capital-gain
o Capital-loss
o Hours-per-week
o Native-country
DEPENDENT VARIABLE:

o Income class

 Data Understanding:
o Data Volume – The dataset contains 48842 rows which contain null
values and unknown values.
o Attribute – there are 6 continuous variables and 8 nominal variables.
 fnlwgt: continuous
 age: continuous
 education-num: continuous
 capital-gain: continuous
 capital-loss: continuous
 hours-per-week: continuous

o These dataset are split into training and test datasets depending on the
split factor given as input.
o The model is built using the training data and validated using the test
data.
 Data Dictionary:
This attribute information provides the type of value that each column in
the dataset can hold.

o age: continuous.
o workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov,
Local-gov, State-gov, Without-pay, Never-worked.
o fnlwgt: continuous.
o education: Bachelors, Some-college, 11th, HS-grad, Prof-school,
Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th,
Doctorate, 5th-6th, Preschool.
o education-num: continuous.
o marital-status: Married-civ-spouse, Divorced, Never-married,
Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
o occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-
managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct,
Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv,
Protective-serv, Armed-Forces.
o relationship: Wife, Own-child, Husband, Not-in-family, Other-
relative, Unmarried.
o race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
o sex: Female, Male.
o capital-gain: continuous.
o capital-loss: continuous.
o hours-per-week: continuous.
o native-country: United-States, Cambodia, England, Puerto-Rico,
Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan,
Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy,
Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France,
Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia,
Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-
Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
o Income class: >50K, <=50K

 Business Understanding:

There will be a significant impact of the personal characteristics and


demographic factors on an individual’s income assuming the person to be an
adult. Income classes are split into two according to the dataset as value of
income column greater than 50K and less than or equal to 50K.

DATA PREPARATION
There are values in the dataset marked as “?” we treated them as “NA”
There are too many levels in the factors, hence, merging them:

Merging WORKING CLASS

Merging MARITAL STATUS

Changing the datatype of variables into “factor” and “numeric” wherever


necessary:
income$native.country <- as.factor(income$native.country)
income$marital.status <- as.factor(income$marital.status)
income$workclass <- as.factor(income$workclass)
str(income)
Checking the structure of “income” again:

Checking and removing the “NA” values


We found 3620 NA values which turns up to be 7.74% of total NA values.

Thus by omitting these values

We can observe that there no “NA” values in the observations. Hence, we can go
ahead with the splitting of data for “test.df” and “train.df”:

We have done 30-70 partition of “income” data into “test.df” and “train.df” data.
Model Selection:
1. Decision Tree
2. Logistic Regression Model

1. Decision Tree

#classification tree
default.ct = rpart(income~., data = train.df, method = "class")

#plot tree
prp(default.ct, type = 5, cex=0.5)

Now, we allow to fully grow the tree:

deeper.ct=rpart(income~., data = train.df, method="class", cp=0, minsplit=1)

prp(deeper.ct,type=5, cex = 0.5)

The accuracy of the generated model:


Now, checking the accuracy using “testing” data:

Even though it is fully grown tree model, the accuracy is closer to 100%. This is good sign
that the generated model is very accurate
Performing cross validation:
Taking the minimum error value to generate a “Best Pruned Tree”
cv.ct = rpart(income~., data = train.df, method="class", cp=0.00001, minsplit=5, xval=5)

printcp(cv.ct)

min.cp = cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"]

#best pruned tree


pruned.ct = prune(cv.ct, cp = min.cp)
prp(pruned.ct,type=5, cex = 0.5)

Checking the important variables

> varImp(cv.ct)
Overall
age 2652.9003
capital.gain 3794.0838
capital.loss 1161.7316
education 3881.9123
educational.num 3583.1706
fnlwgt 2503.3892
gender 246.7625
hours.per.week 1879.5588
marital.status 2555.8307
native.country 720.3707
occupation 2956.7303
race 284.5883
relationship 2676.5847
workclass 980.2665
> varImp(deeper.ct)
Overall
age 3615.3792
capital.gain 3962.4568
capital.loss 1263.6892
education 4380.2679
educational.num 3973.0590
fnlwgt 3934.1958
gender 396.5312
hours.per.week 2597.3727
marital.status 2704.1844
native.country 926.8374
occupation 3454.9851
race 556.2974
relationship 2918.5779
workclass 1394.9231

Logistic Regression:
Performing logistic regression using the important variables we found from Decision Tree.

> with(training_logit, pchisq(null.deviance - deviance,


+ df.null - df.residual, lower.tail = FALSE))
[1] 0
> with(training_logit,1-(deviance/null.deviance))

[1] 0.4185798
test.df$predicted_val<-predict(training_logit, newdata=test.df,type="response
")
test.df$score[test.df$predicted_val>=0.5]<-"1"
test.df$score[test.df$predicted_val<0.5]<-"0"
test.df$score<-as.factor(test.df$score)
confusionMatrix(test.df$score,test.df$income)
varImp(training_logit)
> summary(training_logit)

Call:
glm(formula = income ~ ., family = binomial(link = "logit"),
data = train.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.1152 -0.5163 -0.1940 -0.0212 3.7828

Coefficients: (1 not defined because of singularities)


Estimate Std. Error z value Pr(>|z
|)
(Intercept) -6.375e+00 8.775e-01 -7.265 3.73e-
13 ***
age 2.458e-02 1.663e-03 14.781 < 2e-
16 ***
workclassLocal-gov -7.271e-01 1.112e-01 -6.540 6.14e-
11 ***
workclassPrivate -5.031e-01 9.256e-02 -5.436 5.46e-
08 ***
workclassSelf-emp-inc -3.528e-01 1.207e-01 -2.923 0.0034
69 **
workclassSelf-emp-not-inc -9.883e-01 1.080e-01 -9.149 < 2e-
16 ***
workclassState-gov -8.390e-01 1.218e-01 -6.886 5.75e-
12 ***
workclassWithout-pay -1.038e+00 8.095e-01 -1.282 0.1998
34
fnlwgt 7.931e-07 1.697e-07 4.674 2.96e-
06 ***
education11th 2.207e-01 2.176e-01 1.014 0.3104
23
education12th 6.860e-01 2.741e-01 2.502 0.0123
32 *
education1st-4th -1.867e-01 4.787e-01 -0.390 0.6964
84
education5th-6th -2.097e-02 3.266e-01 -0.064 0.9488
05
education7th-8th -2.721e-01 2.404e-01 -1.132 0.2576
11
education9th -9.970e-02 2.670e-01 -0.373 0.7088
93
educationAssoc-acdm 1.578e+00 1.869e-01 8.442 < 2e-
16 ***
educationAssoc-voc 1.472e+00 1.805e-01 8.152 3.58e-
16 ***
educationBachelors 2.124e+00 1.695e-01 12.525 < 2e-
16 ***
educationDoctorate 2.995e+00 2.233e-01 13.413 < 2e-
16 ***
educationHS-grad 9.981e-01 1.657e-01 6.023 1.71e-
09 ***
educationMasters 2.406e+00 1.792e-01 13.429 < 2e-
16 ***
educationPreschool -9.705e-01 1.166e+00 -0.833 0.4050
79
educationProf-school 2.881e+00 2.114e-01 13.629 < 2e-
16 ***
educationSome-college 1.368e+00 1.678e-01 8.152 3.59e-
16 ***
educational.num NA NA NA
NA
marital.statusMarried-AF-spouse 2.401e+00 5.973e-01 4.020 5.81e-
05 ***
marital.statusMarried-civ-spouse 2.347e+00 2.588e-01 9.069 < 2e-
16 ***
marital.statusMarried-spouse-absent 2.255e-01 2.157e-01 1.045 0.2958
68
marital.statusNever-married -4.788e-01 8.697e-02 -5.505 3.69e-
08 ***
marital.statusSeparated 3.136e-02 1.573e-01 0.199 0.8419
54
marital.statusWidowed 1.623e-01 1.519e-01 1.069 0.2851
73
occupationArmed-Forces -1.756e+00 2.462e+00 -0.713 0.4758
57
occupationCraft-repair 8.220e-02 7.887e-02 1.042 0.2972
65
occupationExec-managerial 8.398e-01 7.576e-02 11.085 < 2e-
16 ***
occupationFarming-fishing -9.109e-01 1.383e-01 -6.585 4.54e-
11 ***
occupationHandlers-cleaners -7.343e-01 1.398e-01 -5.254 1.49e-
07 ***
occupationMachine-op-inspct -2.762e-01 9.999e-02 -2.762 0.0057
38 **
occupationOther-service -8.655e-01 1.165e-01 -7.428 1.10e-
13 ***
occupationPriv-house-serv -2.514e+00 1.022e+00 -2.460 0.0138
89 *
occupationProf-specialty 5.639e-01 8.001e-02 7.049 1.81e-
12 ***
occupationProtective-serv 6.079e-01 1.242e-01 4.895 9.82e-
07 ***
occupationSales 2.946e-01 8.098e-02 3.638 0.0002
75 ***
occupationTech-support 6.036e-01 1.075e-01 5.616 1.96e-
08 ***
occupationTransport-moving -1.271e-01 9.780e-02 -1.300 0.1936
12
relationshipNot-in-family 6.038e-01 2.553e-01 2.365 0.0180
42 *
relationshipOther-relative -2.881e-01 2.338e-01 -1.232 0.2178
46
relationshipOwn-child -5.527e-01 2.526e-01 -2.188 0.0286
90 *
relationshipUnmarried 4.228e-01 2.719e-01 1.555 0.1199
27
relationshipWife 1.093e+00 1.019e-01 10.730 < 2e-
16 ***
raceAsian-Pac-Islander 7.813e-01 2.714e-01 2.879 0.0039
94 **
raceBlack 2.201e-01 2.223e-01 0.990 0.3221
01
raceOther 2.691e-01 3.397e-01 0.792 0.4283
34
raceWhite 4.732e-01 2.098e-01 2.256 0.0240
67 *
genderMale 6.860e-01 7.728e-02 8.877 < 2e-
16 ***
capital.gain 3.231e-04 1.054e-05 30.641 < 2e-
16 ***
capital.loss 6.322e-04 3.669e-05 17.231 < 2e-
16 ***
hours.per.week 2.652e-02 1.617e-03 16.403 < 2e-
16 ***
native.countryCanada -4.571e-01 8.106e-01 -0.564 0.5728
07
native.countryChina -1.983e+00 8.179e-01 -2.425 0.0153
20 *
native.countryColumbia -3.884e+00 1.152e+00 -3.372 0.0007
47 ***
native.countryCuba -9.062e-01 8.255e-01 -1.098 0.2723
00
native.countryDominican-Republic -2.158e+00 1.003e+00 -2.152 0.0314
25 *
native.countryEcuador -1.334e+00 9.739e-01 -1.369 0.1708
67
native.countryEl-Salvador -2.018e+00 9.734e-01 -2.073 0.0381
52 *
native.countryEngland -5.121e-01 8.294e-01 -0.617 0.5369
05
native.countryFrance -6.338e-01 9.044e-01 -0.701 0.4834
30
native.countryGermany -9.552e-01 8.069e-01 -1.184 0.2365
28
native.countryGreece -9.875e-01 8.769e-01 -1.126 0.2601
41
native.countryGuatemala -1.421e+00 1.077e+00 -1.319 0.1870
74
native.countryHaiti -6.606e-01 9.690e-01 -0.682 0.4954
00
native.countryHoland-Netherlands -1.075e+01 5.354e+02 -0.020 0.9839
82
native.countryHonduras -7.202e-01 1.456e+00 -0.495 0.6207
62
native.countryHong -1.744e+00 1.005e+00 -1.737 0.0824
53 .
native.countryHungary -1.121e+00 1.032e+00 -1.086 0.2773
50
native.countryIndia -1.619e+00 7.983e-01 -2.028 0.0425
19 *
native.countryIran -1.287e+00 8.874e-01 -1.450 0.1469
30
native.countryIreland -3.837e-01 9.341e-01 -0.411 0.6812
61
native.countryItaly -1.641e-01 8.304e-01 -0.198 0.8433
31
native.countryJamaica -6.553e-01 8.969e-01 -0.731 0.4650
17
native.countryJapan -1.596e+00 8.475e-01 -1.883 0.0597
57 .
native.countryLaos -2.538e+00 1.381e+00 -1.838 0.0660
75 .
native.countryMexico -1.697e+00 7.940e-01 -2.138 0.0325
26 *
native.countryNicaragua -2.621e+00 1.317e+00 -1.990 0.0466
37 *
native.countryOutlying-US(Guam-USVI-etc) -1.231e+01 1.153e+02 -0.107 0.9149
44
native.countryPeru -2.001e+00 1.028e+00 -1.947 0.0515
38 .
native.countryPhilippines -8.945e-01 7.798e-01 -1.147 0.2513
14
native.countryPoland -9.133e-01 8.569e-01 -1.066 0.2864
99
native.countryPortugal -2.347e-01 8.923e-01 -0.263 0.7925
61
native.countryPuerto-Rico -1.521e+00 8.551e-01 -1.779 0.0752
16 .
native.countryScotland -2.198e+00 1.142e+00 -1.925 0.0542
14 .
native.countrySouth -2.726e+00 8.614e-01 -3.165 0.0015
52 **
native.countryTaiwan -1.208e+00 8.646e-01 -1.397 0.1624
70
native.countryThailand -2.072e+00 1.085e+00 -1.910 0.0561
54 .
native.countryTrinadad&Tobago -2.783e+00 1.354e+00 -2.056 0.0397
71 *
native.countryUnited-States -1.007e+00 7.678e-01 -1.311 0.1898
88
native.countryVietnam -1.952e+00 8.935e-01 -2.184 0.0289
40 *
native.countryYugoslavia -4.707e-01 1.003e+00 -0.469 0.6389
82
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)

m1 = stepAIC(training_logit)
Start: AIC=20797.94
income ~ age + workclass + fnlwgt + education + educational.num +
marital.status + occupation + relationship + race + gender +
capital.gain + capital.loss + hours.per.week + native.country

Step: AIC=20797.94
income ~ age + workclass + fnlwgt + education + marital.status +
occupation + relationship + race + gender + capital.gain +
capital.loss + hours.per.week + native.country

Df Deviance AIC
<none> 20606 20798
- race 4 20626 20810
- fnlwgt 1 20628 20818
- native.country 40 20715 20827
- gender 1 20686 20876
- workclass 6 20732 20912
- marital.status 6 20742 20922
- relationship 5 20822 21004
- age 1 20826 21016
- hours.per.week 1 20879 21069
- capital.loss 1 20911 21101
- occupation 13 21238 21404
- education 15 21593 21755
- capital.gain 1 22428 22618
Interpretations
 From Decision Tree model, we found that all are the variables that predicts
the income levels are significant.
 Accuracy is around 83.92% Balanced accuracy is around 84.31%.
 From Logistic Regression model, we found that all the predictors are
significant at alpha value 0.01, 0.05 and 0.1
 About 41.85% of variance in Dependent variable is explained by the
predictor variables.
 If we look into the value of AIC, it is 20798 which is less. And we know
lesser the value of AIC, better is the model.
 Logistic AUC is 90.69% which shows the model is very accurate. If we look
into the graph, the curve is above the base model. This shows that the model
is highly accurate and significant.
 By logistic Regression we figured out the following 12 important variables -
Age, workclass, fnlwgt, education, marital.status, occupation, relationship,
race, gender, capital.gain, capital.loss , hours.per.week, native.country
Final Code-
setwd (“C:/Users/Desktop/dataset”)
library(caret)
library(rpart)
library(rpart.plot)
library(MASS)

income = read.csv("income.csv")
str(income)
summary(income)
#preprocess begin
table(income$workclass)

income$workclass <- as.character(income$workclass)

income$workclass[income$workclass == "Without-pay" |
income$workclass == "Never-worked"] <- "Unemployed"

income$workclass[income$workclass == "State-gov" |
income$workclass == "Local-gov"] <- "SL-gov"

income$workclass[income$workclass == "Self-emp-inc" |
income$workclass == "Self-emp-not-inc"] <- "Self-employed"

table(income$workclass)

table(income$marital.status)

income$marital.status <- as.character(income$marital.status)

income$marital.status[income$marital.status == "Married-AF-spouse" |
income$marital.status == "Married-civ-spouse" |
income$marital.status == "Married-spouse-absent"] <- "Married"

income$marital.status[income$marital.status == "Divorced" |
income$marital.status == "Separated" |
income$marital.status == "Widowed"] <- "Not-Married"
table(income$marital.status)

income$native.country <- as.character(income$native.country)

north.america <- c("Canada", "Cuba", "Dominican-Republic", "El-Salvador", "Guatemala",


"Haiti", "Honduras", "Jamaica", "Mexico", "Nicaragua",
"Outlying-US(Guam-USVI-etc)", "Puerto-Rico", "Trinadad&Tobago",
"United-States")
asia <- c("Cambodia", "China", "Hong", "India", "Iran", "Japan", "Laos",
"Philippines", "Taiwan", "Thailand", "Vietnam")
south.america <- c("Columbia", "Ecuador", "Peru")
europe <- c("England", "France", "Germany", "Greece", "Holand-Netherlands",
"Hungary", "Ireland", "Italy", "Poland", "Portugal", "Scotland",
"Yugoslavia")
other <- c("South", "?")

income$native.country[income$native.country %in% north.america] <- "North America"


income$native.country[income$native.country %in% asia] <- "Asia"
income$native.country[income$native.country %in% south.america] <- "South America"
income$native.country[income$native.country %in% europe] <- "Europe"
income$native.country[income$native.country %in% other] <- "Other"

table(income$native.country)

income$native.country <- as.factor(income$native.country)


income$marital.status <- as.factor(income$marital.status)
income$workclass <- as.factor(income$workclass)
str(income)

income.data = read.csv("income.csv", na.strings = c(""," ","NA","?"))

income.data = na.omit(income.data)

str(income.data)

####End Preprocess
##################################################
############################################################
####Partition Data 70% training

set.seed(77850)
#train.index=sample(c(1:dim(bank.df)[1]),dim(bank.df)[1]*0.6)
train.index<- sample(c(1:dim(income.data)[1]),dim(income.data)[1]*0.7)
train.df = income.data[train.index,]
test.df = income.data[-train.index,]

dim.data.frame(train.index)
dim.data.frame(test.df)

#### Select relevant variable using a decison tree

library(rpart)

library(rpart.plot)

set.seed(77850)

#classification tree
default.ct = rpart(income~., data = train.df, method = "class")

#plot tree
prp(default.ct, type = 5, cex=0.5)

deeper.ct=rpart(income~., data = train.df, method="class", cp=0, minsplit=1)

prp(deeper.ct,type=5, cex = 0.5)


default.model = predict(default.ct, train.df, type="class")

library(caret)

#generate confusion matrix for training data


confusionMatrix(default.model,as.factor(train.df$income))

default.model1 = predict(default.ct, test.df, type = "class")


#generate confusion matrix for training data
confusionMatrix(default.model1,as.factor(test.df$income))
cv.ct = rpart(income~., data = train.df, method="class", cp=0.00001, minsplit=5, xval=5)

printcp(cv.ct)

min.cp = cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"]

#best pruned tree


pruned.ct = prune(cv.ct, cp = min.cp)
prp(pruned.ct,type=5, cex = 0.5)

varImp(cv.ct)

varImp(deeper.ct)

m1
#Logistic Regression
training_logit = glm(income ~ age + workclass + fnlwgt + education + marital.status +
occupation + relationship + race + gender + capital.gain +
capital.loss + hours.per.week + native.country, data = train.df, family =
binomial(link="logit"))

with(training_logit, pchisq(null.deviance - deviance,


df.null - df.residual, lower.tail = FALSE))

with(training_logit,1-(deviance/null.deviance))

summary(training_logit)

m1 = stepAIC(training_logit)

m1
test.df$predicted_val<-predict(training_logit, newdata=test.df,type="response")
test.df$score[test.df$predicted_val>=0.5]<-"1"
test.df$score[test.df$predicted_val<0.5]<-"0"
test.df$score<-as.factor(test.df$score)
confusionMatrix(as.factor(test.df$score),as.factor(test.df$income))
varImp(training_logit)
library(ROCR)
pred<-prediction(test.df$predicted_val,test.df$income)
perf <- performance(pred,"tpr","fpr")
plot(perf)
abline(a=0,b=1)
#Create AUC data
aucval<-performance(pred,"auc")
#Calcualte AUC
logistic_auc<-as.numeric(aucval@y.values)
#Display the auc value
logistic_auc

S-ar putea să vă placă și