Sunteți pe pagina 1din 15

BANK LOAN RECOMMENDATION SYSTEM

ABSTRACT

A loan is the lending of money by one or more individuals, organizations, or banks to other
individuals, organizations etc. The recipient (the borrower) incurs a debt, and is usually
liable to pay interest on that debt until it is repaid, and also to repay the principal amount
borrowed.

Nowadays banks are giving loans to their customers for different purposes. Banks are
providing two types of loans, personal loans and business loans. But the problem is to find
which type of loan has to be recommended for the individual and which best suits them. To
solve this problem we have considered their customers credit data, based on these details,
loans should be recommended to the customers. If customer wanted to buy personal stuff
then this model will check the data of the customer and predict whether he is capable of
getting loan or not. Same with the business loan that, if the customer wanted to invest on
education then, this model checks his details and thus recommend a business loan.

For this purpose, we use supervised machine learning technique called classification
algorithms to predict if a particular person is eligible for a bank loan or not, considering the
previous transactional information of the person. If yes, then predict which type of loan he
is eligible for, personal or business.

1
TABLE OF CONTENTS

sno Contents Page no

1 Introduction 3

2 Statement of problem 5

3 Literature Survey 6

4 Software Requirement Specification 12

5 References 15

2
1. INTRODUCTION

Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focus on development of computer programs that can
access and use it to learn for themselves. Machine learning is categorised into two types
supervised learning and unsupervised learning. The classification model comes under
supervised learning.

Classification is a supervised learning approach in which the computer program learns from
the data input given to it and then uses this learning to classify new observation. There are
different types of classification models some of them are:

1.1 DECISION TREE CLASSIFIER:

Decision Tree Classifier is easy to understand and interpret as the trees are displayed in the
graphical format, even the non experts can easily understand the format. This model can
deal well with large datasets. The explanation for the result in this model was
understandable (whitebox). It performs well when the constructed tree is small. The
decision tree model can produce large complex trees by which training data was not
generalised well. Here the data can be represented in the form of trees, where observation
about the target item can be explained in branch of the tree and the conclusion about the
data item is explained in leaves of the tree. The constructed decision trees are used to
explicitly represent decisions and make decisions.

1.2 LOGISTIC REGRESSION:

The logistic regression model is easy to implement and efficient in training. It performs well
when we remove all the unwanted data for the prediction of the output. It has the high
reliance on the proper presentation of the data. The logistic regression has only two values
of output. In logistic Regression it understand the relationship between the dependent
variables and the features and then find the probabilities using the logistic function. These
probabilities are transformed into binary values as 0 or 1.based on these values we predict
the result as true or false(0 or 1) .

3
Logistic regression is the appropriate regression analysis to conduct when the dependent
variable is dichotomous (binary). Like all regression analyses, the logistic regression is a
predictive analysis. Logistic regression is used to describe data and to explain the
relationship between one dependent binary variable and one or more nominal, ordinal,
interval or ratio-level independent variables.

1.3 ADABOOST CLASSIFIER:

AdaBoost is best used to boost the performance of any classification problem. In AdaBoost
model the weak learners take the data and apply some boosting techniques on the data. Here
the training is done iteratively. During the training iteration if the classification of training
data is done well then it gives higher weights for the learning model that it was unable to
classify correctly. In the next training iteration it focus more on complex and misclassified
observations and based on the result of the observation it gives higher weights for the model
and repeat the iterations defined by the user to find the set of weighted hypothesis that
combined and that can perform best on unseen data. A weak classifier is prepared on
training data using the weighted samples. It makes one decision on one input variable and
makes the output as positive or negative. These weak learners are combined then to form a
strong model. The predictions are made by calculating the weighted average of the weak
classifiers.

1.4 SVM CLASSIFIER:

SVM are supervised learning model with associated learning algorithms that analyse the
data used for classification. An SVM model is a representation of the examples as points in
space, mapped so that the examples of the separate categories are divided by a clear gap that
is as wide as possible. New examples are then mapped into that same space and predicted to
belong to a category based on which side of the gap they fall. The SVM uses the kernel
function which help to separate the classes. The kernel function performs the transferring of
nonlinear spaces to linear ones.

1.5 XGBOOST CLASSIFIER:

XGBoost classifier is an implementation of gradient boosted decision trees designed for


speed and performance. This model implements the gradient boosting decision tree
algorithm. In Gradient boosting approach the new models are created that predicts the

4
errors of prior model and then added to make the final predictions, this approach attempts to
predict the target by combining estimaters of a set of simpler and weak models . The
XGBoost model is used to reduce the loss and to increase the performance. The XGBoost
Classifier mainly focus on the generic loss functions.

5
2. STATEMENT OF PROBLEM

Banks are giving loans for their customers, they are not sure that the people who take the
loan are able to refund or not, but they have to provide the loans to extend their markets. To
know who are capable of paying back the loan and recommend them to take the loan we are
developing a bank loan recommendation system for the banks to know the right customer to
give the loan and recommend them to take their preferred loan like business loan or
personal loan . The categories like investing in a business or for education purpose or
repairing any appliances comes under business loans . Buying cars, and other accessories
comes under personal loans.

6
3. LITERATURE SURVEY

3.1 Lucas Frei’s Credit Card Fraud Detection system

Online money transactions and Credit/Debit card transactions these days, has become very-
common way of paying someone. But at the same time credit/debit card fraud is also
something that has been growing from day to day. To detect this Lucas Frei has proposed a
Machine Learning classification model to detect if a record is fraud or not based on its
previous transactions. In order to generate a Machine Learning model to solve this we have
to have a data set of predictor variables and a response variable. So for this purpose credit
card transaction dataset is collected from Kaggle which has a total 31 features and out of
which 28 are anonymized and are labeled as z1 through z28. The only three features
available in the data are the predictor variables time of transaction, the transaction amount
and the response variable which is predicted as fraudulent or not.

The classification model here is generated in steps, they are

1. Data pre-processing
2. Dataset-splitting
3. Model generation
4. Evaluating the model

Data pre-processing

The data in the dataset had few empty cells, which is not a desirable thing for using it to
train a model. The accuracy of the model would be affected if the input dataset has empty
entries. So to avoid this the mean of the feature is calculated and the empty values of the
feature is replaced with the mean of the non-empty feature values.

Secondly, the amount column has very large values and time column has very small values.
This would become a problem while training a model which uses distance metrics. Due to
the huge value difference the priority of the amount column will outweigh the priority of the
time column while calculating the distances. So to avoid this the columns are normalized to
the values between 0 and 1, using a normalization technique called z-score normalization.

7
Data Splitting

This has been the most challenging part because the dataset has very low number of tuples
which are fraudulent. Most of the examples in the dataset are non-fraudulent i.e, the dataset
is highly imbalance. This imbalance is a huge problem while training the model, so in order
to avoid this, we take all the fraudulent tuples and an equal number of non-fraudulent tuples
so that every time when we train the model, the data is balanced (equal number of positive
class and negative class members).

Model Generation

In this part, the dataset is divided into two sets, training set and testing set. To avoid
overfitting, we use k-fold cross-validation technique to split the dataset into training and
testing datasets. Here this technique splits the dataset into k subsets and trains the model on
k-1 training subsets before testing on kth subset. This process is repeated until each of k
subsets acted as a testing set at least once. Now to find out which algorithm works best, we
gave the we applied the training dataset to a wide number of classification algorithms:

 Logistic Regression
 K Nearest Neighbours
 Decision Tree Classifier
 Support Vector Classifier
 Random Forest Classifier

Model Evaluation

After training the model using training set and testing the data using testing set, the results
that are predicted using the model are cross checked with the actual class label by the means
of evaluation measures like Accuracy, Precision, Recall and the model with better values of
evaluation measures are taken into consideration, and are further more trained to increase
the accuracy of prediction.

3.2 Finding donors project using machine learning classification:

Many charity organizations has an issue of finding donors who are capable of donating
money to their organization for that they have used machine learning classification
algorithms to predict whether the person is capable of donating or not because the

8
organizations mostly run on donations and if they find any one who are capable of donating
with good income they can ask them for donations . for that u.s census dataset was taken
from kaggle donated by Ron Kohavi and Barry Becker. For that they have used
classification algorithmin machine learning. different steps are involved in the generation of
the model. They are:

Preprocessing of data:

Firstly the data set has many empty values which will effect the generation of the model and
the accuracy score of the model also affected. Here the null values are repeated and other
preprocessing techniques like normalization are also implemented on the data to eliminate
the disturbances in the data. There are several features which are non-numeric. Typically,
learning algorithms expect input to be numeric, which requires that non-numeric features
(called categorical variables) be converted. One popular way to convert categorical
variables is by using the one-hot encoding scheme. The prediction feature is encoded as 1 if
the person is a donor and 0 if he is not a donor.

Splitting of data:

For generating the model the data has to be splitted as training data set and testing data set.
There splitted 80% of data for training and 20% of data for testing

Generating the model:

After splitting the data the models are generated. They have applied some supervised
learning models that are taken from scikit learn. The different classification models that are
used in this project are

 Gaussian Naive Bayes (GaussianNB)


 Decision Trees
 Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)
 K-Nearest Neighbors (KNeighbors)
 Stochastic Gradient Descent Classifier (SGDC)
 Support Vector Machines (SVM)
 Logistic Regression

9
Model evaluation: In order to measure the prediction of the applied model different
evaluation metrics are used. The evaluation metrics used are :

Accuracy measures how often the classifier makes the correct prediction. It’s the ratio of
the number of correct predictions to the total number of predictions (the number of test data
points).

Precision tells us what proportion of messages we classified as spam, actually were spam. It
is a ratio of true positives to all positives(all words classified as spam, irrespective of
whether that was the correct classification), in other words it is the ratio of

[True Positives/(True Positives + False Positives)]

3.3 Diabetes prediction using medical data by Dr. D. Asir Antony Gnana Singh:

Nowadays, diabetes has become a common disease to the mankind from young to the old
persons. The growth of the diabetic patients is increasing day-by-day due to various causes
such as bacterial or viral infection, toxic or chemical contents mix with the food, auto
immune reaction, obesity, bad diet, change in lifestyles, eating habit, environment pollution,
etc. Hence, diagnosing the diabetes is very essential to save the human life from diabetes.
The data analytics is a process of examining and identifying the hidden patterns from large
amount of data to draw conclusions. In health care, this analytical process is carried out
using machine learning algorithms for analysing medical data to build the machine learning
models to carry out medical diagnoses. This paper presents a diabetes prediction system to
diagnosis diabetes. Moreover, this paper explores the approaches to improve the accuracy in
diabetes prediction using medical data with various machine learning algorithms and
methods.

This section review various research works that are related to the proposed work.
Mohammed Abdul Khaleeletal conducted a survey on data mining techniques on medical
data for finding locally frequent diseases. The main focus of this survey is to analyse the
data mining techniques required for medical data analysis that is especially used to
discover locally frequent diseases such as heart lung cancer, ailments, breast cancer
using classification and regression tree algorithm and the decision tree algorithms such as
ID3, C4.5

10
Diabetes Prediction Using Medical Data
1. Chunhui Zhao et al presented a system for Subcutaneous Glucose Concentration
prediction. This proposed model can predict the type 1 diabetes mellitus
2. Vaishali Agarwal et al presented a performance analysis of the competitive learning
algorithms on Gaussian data for automatic cluster selection and also studied and analysed
the performance of these algorithms and randomized results have been analysed on 2-D
Gaussian data with the learning rate parameter kept simple for all algorithms. Algorithms
used in their work include clustering algorithm, competitive learning algorithm and
frequency sensitive competitive learning algorithm. Supervised learning machine
algorithms are used for classification of the Gaussian data.
3. K.Srinivas et al developed applications of data mining techniques in healthcare and
prediction of heart attacks. This research used medical profiles such as age, sex, blood
pressure and blood sugar and predicted the likelihood of patients getting a heart and kidney
problems.
4. M. Durairaj and V. Ranjani discussed the potential use of classification-based data
mining techniques such as rule-based methods, decision tree algorithm, Naïve Bayes and
artificial neural network (ANN) to the massive volume of healthcare data. In this research,
medical problems have been analysed and evaluated such as heart disease and blood
pressure.
5. SalimDiwani, et al discussed the applications of data mining in health care. This paper
also presented an overview ofresearch on health care application using data mining
techniques. Data mining is a technology that is used for knowledge discovery in databases
(KDD) and data visualization. Moreover, the medical data in the form of text, and the
digital medical images such as X-rays, magnetic resonance imaging (MRI) are used for
disease diagnostic processing.
6. Darcy A. Davis proposed individual disease risk prediction based on medical history.
This paper also predictseach patient’s greatest disease risks based on their own medical
history data. Dataset are used for medical coding and collaborative From this literature, it is
observed that the machine learning algorithms place a significant role in knowledge
discovery form the databases especially in medical diagnosis with the medical data.

11
4. SOFTWARE REQUIREMENTS SPECIFICATION
4.1 Introduction

Software Requirement Analysis is the first step in the process of software development.
Here we know that as the complexity of the system increases, it becomes evident that goal
of entire system cannot be comprehended easily. Hence the needs for requirement phase
arose. The Software Requirement Analysis is the means of translating the ideas of the minds
of clients into a formal document, it is the medium through which the client and user needs
are accurately specified. It forms the basis of software development. A good Software
Requirement Analysis should satisfy all the parties involved in the system.

4.2 Purpose

The purpose of the project is to predict if a bank customer is eligible for a personal loan or a
business loan, using a supervised Machine Learning technique called classification.

4.3 Scope

This project makes use of the previous credit/debit card transactional data of the customer
for the analysis and makes a prediction if the customer is eligible for a personal loan or
business loan.

4.4 Objectives

 To predict if a customer is a personal loan eligible person or a business loan eligible


person.
 To improve the accuracy of the developed model to a maximum value.

4.5 Functional Requirements

The Functional Requirements define the functionality of the developed software or its
components. It must specify how a system must respond or behave when specific inputs or
conditions are given to it. This include, the steps like pre-processing of data before model
generation, training of the model, testing the model and other functionality that define what
a system must accomplish.

12
 Firstly, we pre-process the data by filling the empty cells with the mean of the non-
empty values of the same column.
 We split the pre-processed dataset into training and testing sets.
 Then we train the model using training dataset.
 Next, we test the generated model using testing dataset and find out the measures
like Accuracy, Precision, Recall etc.
 Finally, we predict the output using the model.

4.6 Non-Functional Requirements

In requirements engineering, a Non-Functional Requirement is a requirement that specifies


criteria that can be used to judge the operation of a system, rather than specific behaviours.
They are contrasted with functional requirements that define specific behaviour or
functions. The plan for implementing functional requirements is detailed in the system
design. The plan for implementing non-functional requirements is detailed in the system
architecture, because they are usually architecturally significant requirements.

4.6.1 Performance

Performance is measured in terms of the output given by the model. Here various
performance measures such as Accuracy, Precision, Recall, etc are used to analyse the
performance of the model.

4.6.2 Cost effectiveness

The total cost involved in developing this software is zero, since we use only open-source
development environment.

4.6.3 Software Requirements

Operating System: Windows 7/8.1/10

Programming Language: Python

Python Modules used: NumPy, Pandas, Matplotlib, Sci-kit learn

13
4.6.4 Hardware Requirements

Processor: Intel dual core i3/i5/i7

Ram: 8 GB

Others: Standard Monitor, Keyboard, Mouse

14
5. REFERENCES

[1]. Lukas Frei, data science consultant at PwC, detecting credit card fraud using machne
learning, January 16, 2019

[2]. Sanjal Sharma, Data Scientist at Unscrambl Inc Master of IT, University of Melbourne,
“Finding Donors for Charity using Machine Learning”, may 17,2017

[3]. Dr. D. Asir Antony Gnana Singh, Diabetes prediction using medical data , January 13
,2017

15

S-ar putea să vă placă și