Sunteți pe pagina 1din 5

(1) The data attribute distribution

Out of all the selected features: "age", "balance", "campaign", "pdays" are numerical features.

While "job","marital","education","default","housing","loan","previous","poutcome" are categorical features.

"deposit" being our target variable which tells us whether the client subscribed to a term deposit.

Let’s explore categorical variables first. The image shows the count plot of all categorical variables which tells us how the
feature members are distributed:
Now let’s take a look at the distribution of numerical variables using common descriptive metrics like mean, min, max
standard deviation, and few known metrics:

age balance day duration campaign pdays previous


count 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000
mean 41.231948 1528.538524 15.658036 371.993818 2.508421 51.330407 0.832557
std 11.913369 3225.413326 8.420740 347.128386 2.722077 108.758282 2.292007
min 18.000000 -6847.000000 1.000000 2.000000 1.000000 -1.000000 0.000000
25% 32.000000 122.000000 8.000000 138.000000 1.000000 -1.000000 0.000000
50% 39.000000 550.000000 15.000000 255.000000 2.000000 -1.000000 0.000000
75% 49.000000 1708.000000 22.000000 496.000000 3.000000 20.750000 1.000000
max 95.000000 81204.000000 31.000000 3881.000000 63.000000 854.000000 58.000000

Our only remaining variable is the target variable. The count plot of “deposit” shows nearly equal number of “yes” and “no” =>
balanced classes. Balanced classes are good for ML algos.

(2) The methods/algorithms you used for data wrangling and processing

a. Removing invalid rows: After looking at the categorical variables we found out that the several categorical
variables have “unknown” member. It can be some missing data point which resulted in collection of partial
information. We can take care of it in 2 ways: fill it with some predefined or calculated value or remove the
rows containing this member. We chose the latter one because filling it up with some value might result in
inaccurate model and it is far easier.

After removing invalid rows we still have 2693 rows left.

b. One hot encoding: Categorical variables can either be numbers or strings. Both have problems. ML models
can only work on numbers. If we treat categorical numbers as numerical variables, it will apply weight to the
categories. And Strings are not supported.
So for each member of the variable, One hot encoding generates a new column. Each column will have either 0
or 1 as its members. For example: “loan” has 2 values: [“yes”,”no”]. OHE converts the loan columns into 2
columns namely “loan_yes” and “loan_no”. When loan=”yes”, loan_yes = 1 and loan_no = 0 and vice versa.

c. Min max normalization: Now we have all our features in numerical form. Each feature will be in a certain
range. Normalization means applying some mathematical function to bring all the features into some particular
range. Normalizing data helps models learn faster.

One such technique is min-max normalization which changes bring each feature in the range from 0 to 1.

Formula: (x-min(x)/(max(x)-min(x))

(3) The performance of both unsupervised and supervised learning on the data

The supervised models performed better than unsupervised ones. For every model except PCA we compared
performance using accuracy. In case of PCA we generated a scatter plot of 2nd component vs 1st component:

Model Type Algorithm Accuracy


Unsupervised K-Means 47.97%
Unsupervised PCA

Supervised Logistic Regression 75.25%


Supervised Decision Tree 74.75%
Supervised Naïve Bayes 74.37%
(4) The important features which affect the objective (‘yes’ in ‘deposit’) [Hint: you can refer
the coefficients generated from the Logistic Regression]

Logistic Regression returned these coefficients after training:

feature coefficient
[0.1459, 0.0491, -0.1256, 0.3392, -0.3676, -0.0581, -0.4963, -0.7585, 0.0213,
job 0.4456]
marital [-0.0982, 0.1456]
education [0.154, -0.2868]
default [1.2772]
housing [0.8121]
loan [-0.7073]
poutcome [0.3084, -1.7786]
age [0.5886]
balance [4.2845]
campaign [1.9627]
pdays [0.5241]
previous [1.3744]

One basic way to look at importance of a variable in any linear model is to look at
its coefficient generated from the linear model.

“balance” seems to be the most important numerical variable as it has the largest
coefficient. After that, “campaign” then “previous” then “age” and “pdays”.

Categorical feature importance is a little tricky as the coefficients gives the


importance of each class. I think the best (easy) way to judge the importance of a
categorical feature in a Logistic Regression models is to look at the class of the
feature having the highest coefficient value and how much does it compare with the
rest of the features. “poutcome”, “default”, and “housing” seems to be the most
important features having coefficients near 1.

Other ways to measure the importance of a feature includes methods such as


Correlation(for numerical features) and Multi-way tables(for categorical features).

(5) Discuss the possible reasons for obtaining these analysis results and how to improve them

Overall the supervised models seems to be performing better than the unsupervised ones (only K Means). Why?
I am not sure but one reason that stands out the most is class imbalance that wasn’t present before but
occurred after removing invalid rows. Maybe K-Means wasn’t able to explain the some part of class imbalance.

On improving:
We can convert age into categorical variable. It might have a different effect on the resulting accuracy.
We can include more features like day, month, duration, et cetera which might be able to explain the
unexplained variance in target feature.
Using ensemble models like Random Forest might improve the accuracy.
Instead of removing the invalid rows, we could impute some intuitive value in them.

(6) Describe the group activities, such as the task distribution for group members and what you
have learnt during this project.

S-ar putea să vă placă și