Documente Academic
Documente Profesional
Documente Cultură
Out of all the selected features: "age", "balance", "campaign", "pdays" are numerical features.
"deposit" being our target variable which tells us whether the client subscribed to a term deposit.
Let’s explore categorical variables first. The image shows the count plot of all categorical variables which tells us how the
feature members are distributed:
Now let’s take a look at the distribution of numerical variables using common descriptive metrics like mean, min, max
standard deviation, and few known metrics:
Our only remaining variable is the target variable. The count plot of “deposit” shows nearly equal number of “yes” and “no” =>
balanced classes. Balanced classes are good for ML algos.
(2) The methods/algorithms you used for data wrangling and processing
a. Removing invalid rows: After looking at the categorical variables we found out that the several categorical
variables have “unknown” member. It can be some missing data point which resulted in collection of partial
information. We can take care of it in 2 ways: fill it with some predefined or calculated value or remove the
rows containing this member. We chose the latter one because filling it up with some value might result in
inaccurate model and it is far easier.
b. One hot encoding: Categorical variables can either be numbers or strings. Both have problems. ML models
can only work on numbers. If we treat categorical numbers as numerical variables, it will apply weight to the
categories. And Strings are not supported.
So for each member of the variable, One hot encoding generates a new column. Each column will have either 0
or 1 as its members. For example: “loan” has 2 values: [“yes”,”no”]. OHE converts the loan columns into 2
columns namely “loan_yes” and “loan_no”. When loan=”yes”, loan_yes = 1 and loan_no = 0 and vice versa.
c. Min max normalization: Now we have all our features in numerical form. Each feature will be in a certain
range. Normalization means applying some mathematical function to bring all the features into some particular
range. Normalizing data helps models learn faster.
One such technique is min-max normalization which changes bring each feature in the range from 0 to 1.
Formula: (x-min(x)/(max(x)-min(x))
(3) The performance of both unsupervised and supervised learning on the data
The supervised models performed better than unsupervised ones. For every model except PCA we compared
performance using accuracy. In case of PCA we generated a scatter plot of 2nd component vs 1st component:
feature coefficient
[0.1459, 0.0491, -0.1256, 0.3392, -0.3676, -0.0581, -0.4963, -0.7585, 0.0213,
job 0.4456]
marital [-0.0982, 0.1456]
education [0.154, -0.2868]
default [1.2772]
housing [0.8121]
loan [-0.7073]
poutcome [0.3084, -1.7786]
age [0.5886]
balance [4.2845]
campaign [1.9627]
pdays [0.5241]
previous [1.3744]
One basic way to look at importance of a variable in any linear model is to look at
its coefficient generated from the linear model.
“balance” seems to be the most important numerical variable as it has the largest
coefficient. After that, “campaign” then “previous” then “age” and “pdays”.
(5) Discuss the possible reasons for obtaining these analysis results and how to improve them
Overall the supervised models seems to be performing better than the unsupervised ones (only K Means). Why?
I am not sure but one reason that stands out the most is class imbalance that wasn’t present before but
occurred after removing invalid rows. Maybe K-Means wasn’t able to explain the some part of class imbalance.
On improving:
We can convert age into categorical variable. It might have a different effect on the resulting accuracy.
We can include more features like day, month, duration, et cetera which might be able to explain the
unexplained variance in target feature.
Using ensemble models like Random Forest might improve the accuracy.
Instead of removing the invalid rows, we could impute some intuitive value in them.
(6) Describe the group activities, such as the task distribution for group members and what you
have learnt during this project.