Diabetes

A MAJOR-PROJECT REPORT
ON
“A COMPARATIVE STUDY ON PREDICTION

AND DIAGNOSIS OF DIABETES”
Submitted to
KIIT, Deemed to be University
In Partial Fulfillment of the Requirement for the Award of
BACHELOR’S DEGREE IN
COMPUTER ENGINEERING
BY
SOUVIK PODDER 1405171

SRIJANI CHAKROBORTY 1405173
SUDEEPTA BAL 1405175
SUMIT 1405176
UNDER THE GUIDANCE OF

PROF. HARISH KUMAR PATTNAIK
SCHOOL OF COMPUTER ENGINEERING

KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
BHUBANESWAR, ODISHA - 751024
2017-2018
A MAJOR-PROJECT REPORT
ON
“A COMPARATIVE STUDY ON PREDICTION AND
DIAGNOSIS OF DIABETES”
Submitted to
In Partial Fulfillment of the Requirement for the Award of
BACHELOR’S DEGREE IN
COMPUTER ENGINEERING
BY

SRIJANI CHAKROBORTY 1405173
SUMIT 1405176
UNDER THE GUIDANCE OF

PROF. HARISH KUMAR PATTNAIK
SCHOOL OF COMPUTER ENGINEERING

KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
BHUBANESWAE, ODISHA -751024
2017-2018
Submitted to
School of Computer Engineering
Bhubaneswar, ODISHA 751024
CERTIFICATE
This is certify that the project entitled
“A COMPARATIVE STUDY ON PREDICTION AND
DIAGNOSIS OF DIABETES“
submitted by

SRIJANI CHAKRABORTY 1405173
SUMIT 1405176
is a record of bonafide work carried out by them, in the partial fulfilment of the
requirement for the award of Degree of Bachelor of Engineering in Computer Sci-
ence at Kalinga Institute of Industrial Technology (KIIT), Deemed to be University,
Bhubaneswar. This work is done during year 2017-2018, under our guidance.
Date: 06 / 04 / 2018
(Prof.HARISH KUMAR PATTNAIK) (Prof. PINAKI CHATTERJEE)

Project Guide Project Coordinator
Acknowledgements
Apart from our efforts the success of this project depends largely on the encour-
agement and guidelines of many others. We take this opportunity to express our
gratitude to the people who have been instrumental in the successful completion of
this project. We take immense pleasure in thanking and warmly acknowledging the
continuous encouragement, invaluable supervision, timely suggestions and the in-
spired guidance offered by our project mentor Prof. Harish Kumar Pattnaik, School
of Computer Science and Engineering, Kalinga Institute of Industrial Technology
(KIIT), Deemed to be University, in bringing this report to a successful completion.
We are grateful to Dr S. Mishra, Dean of School of Computer Science and Engineer-
ing, Kalinga Institute of Industrial Technology (KIIT), Deemed to be University for
permitting us to make use of the facilities available in the department to carry out
the project successfully. We also express our sincere thanks to all our friends who
have patiently extended all sorts of help for accomplishing this undertaking. Finally,
we extend our gratefulness to one and all who are directly and indirectly involved in
successful completion of this project work.

SRIJANI CHAKRABORTY 1405173
SUMIT 1405176
ABSTRACT
Diabetes is one of the most pervasive diseases in the world.It leads to heart attack,
blindness, kidney diseases, etc. The patient needs to visit a diagnostic center, consult
a doctor and wait for the reports. Moreover, every time one wants to get the diagnosis
report, one has to spend money unnecessarily.Early prediction of the disease leads
to treatment of patients before it becomes critical.
The phenomenal advancement in biotechnology and health sciences have led to

high throughput of data.Research in all aspect of diabetes has led to the generation
of extensive data. The aim of the present research is to review the application of ma-
chine learning techniques and tools in the field of diabetes with respect to prediction
and diagnosis, genetic background and environment, and health care management.
A wide range of machine learning algorithms have been employed. This project
aims to develop a system that can predict the diabetic chances of a patient and help
in early diagnosis. Various machine learning algorithms used are Support Vector
Machines, Neural Networks, KNN, Decision Tree. A comparative study, based on
the accuracy level of the different algorithms used has been done.
Keywords:
Diabetes, Machine Learning, Diagnosis, SVM, Neural Network, KNN, Decision
Tree.
Contents
1 Introduction 2
1.1 MACHINE LEARNING . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 BRIEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 APPLICATION OF MACHINE LEARNING . . . . . . . . 2
1.1.3 PURPOSE . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 SCOPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 4
3 PROBLEM DEFINITION 6
4 Project Planning 7
4.1 Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Requirement Analysis 8
5.1 TECHNOLOGY PREQUIREMENTS: . . . . . . . . . . . . . . . . 8
5.2 DATA REQUIREMENTS: . . . . . . . . . . . . . . . . . . . . . . 8
6 System Design 10
6.1 EXPLORING THE DATA . . . . . . . . . . . . . . . . . . . . . . 10
6.2 STRATIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.3 SUPERVISED LEARNING MODEL . . . . . . . . . . . . . . . . 11
6.4 K-FOLD CROSS VALIDATION . . . . . . . . . . . . . . . . . . . 11
6.5 CALCULATING THE ACCURACY . . . . . . . . . . . . . . . . . 11
6.6 ENSEMBLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.7 NEURAL NETWORK MODEL . . . . . . . . . . . . . . . . . . . 12
6.8 OPTIMIZING THE NEURAL NETWORK MODEL . . . . . . . . 12
6.9 CREATING REPORT ABOUT THE MODEL . . . . . . . . . . . . 12
7 System Testing 13
7.1 TESTING NEURAL NETWORK . . . . . . . . . . . . . . . . . . 13
8 Implementation 14
9 Result of Project 28
9.1 SUPERVISED LEARNING . . . . . . . . . . . . . . . . . . . . . 28
9.2 REINFORCEMENT LEARNING . . . . . . . . . . . . . . . . . . 29
10 Conclusion 30
11 Future Scope 31
References 32
List of Figures
4.1 Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.1 SYSTEM DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . 10
7.1 ACTUAL OUTCOME VS PREDICTED OUTCOME FOR THE

FIRST 25 ROWS . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
8.1 DESCRIPTION OF THE STANDARDIZED DATA . . . . . . . 15

8.2 PLOTTING THE DIABETES DATA-SET . . . . . . . . . . . . 15
8.3 HISTOGRAM OF ALL THE FEATURES IN THE DATA-SET . 16
8.4 HISTOGRAM OF ALL THE FEATURES OF DIABETES PA-
TIENT USING HUE . . . . . . . . . . . . . . . . . . . . . . . . . 17
8.5 HISTOGRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8.6 PAIR PLOT OF ALL THE FEATURES . . . . . . . . . . . . . . 19
8.7 HEAT MAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.8 LOGISTIC REGRESSION . . . . . . . . . . . . . . . . . . . . . 21
8.9 SUPPORT VECTOR MACHINE . . . . . . . . . . . . . . . . . 21
8.10 K NEAREST NEIGHBOUR . . . . . . . . . . . . . . . . . . . . 21
8.11 DECISION TREE . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8.12 RANDOM FOREST . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.13 MEAN CROSS VALIDATION SCORE OF ALL SUPERVISED
ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.14 CHANGE IN ACCURACY OF THE MODELS AFTER CROSS
VALIDATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.15 BAR PLOT OF CHANGE IN ACCURACY . . . . . . . . . . . . 22
8.16 BOX PLOT OF THE ACCURACY OF SUPERVISED LEARN-
ING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
8.17 NEURAL NETWORK MODEL . . . . . . . . . . . . . . . . . . 24
8.18 BATCH SIZE AND EPOCH . . . . . . . . . . . . . . . . . . . . 24
8.19 BEST BATCH SIZE AND EPOCH . . . . . . . . . . . . . . . . 24
8.20 LEARNING RATE AND DROPOUT RATE . . . . . . . . . . . 25
8.21 BEST LEARNING RATE AND DROPOUT RATE . . . . . . . . 25
8.22 ACTIVATION AND INITIALIZATION . . . . . . . . . . . . . . 26
8.23 BEST ACTIVATION AND INITIALIZATION . . . . . . . . . . 26
8.24 NEURONS IN EACH LAYER . . . . . . . . . . . . . . . . . . . 27
8.25 BEST NEURONS IN EACH LAYER . . . . . . . . . . . . . . . 27
8.26 ACCURACY SCORE, CLASSIFICATION REPORT AND CON-
FUSION MATRIX . . . . . . . . . . . . . . . . . . . . . . . . . . 27
9.1 BAR PLOT SHOWING THE ACCURACY OF THE SUPER-

VISED MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9.2 BOX PLOT OF THE ACCURACY OF SUPERVISED LEARN-
ING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9.3 ACCURACY SCORE AND CLASSIFICATION REPORT FOR
NEURAL NETWORK . . . . . . . . . . . . . . . . . . . . . . . 29
9.4 CONFUSION MATRIX FOR THE NEURAL NETWORK . . . 29
11.1 ENSEMBLE LEARNING . . . . . . . . . . . . . . . . . . . . . . 31

11.2 PARALLEL PROCESS THE ABOVE TASK . . . . . . . . . . . 32
A COMPARATIVE STUDY ON PREDICTION AND DIAGNOSIS OF DIABETES
Chapter 1
Introduction
1.1 MACHINE LEARNING
1.1.1 BRIEF
Machine learning is very uselful in automation of analytical model building by us-

ing method of data analysis. With the help of the algorithms which actually concerns
with data in an iterative manner,machine learning in computers allows to locate the
hidden insights without writing any piece of external code for it .
In our project we have used various supervised learning model and neural net-
work model on Pima Indians Diabetes data-set to predict whether a patient is having
diabetes or not.We made a comparative analysis of all the algorithms by calculating
their accuracy score, classification report and confusion matrix.
1.1.2 APPLICATION OF MACHINE LEARNING
Machine Learning is used in various real world application
a. Fraud detection.
b. Web search results.
c. Credit scoring and next-best offers.
d. Prediction of equipment failures.
e. Recommendation Engines.
f. Customer Segmentation.
g. Text Sentiment Analysis.
School of Computer Engineering, KIIT, BBSR 1

h. Pattern and image recognition.
i. Email spam filtering.
j. Financial Modeling
1.1.3 PURPOSE
In context to predictive analysis, several machine learning algorithms helps us to

gather sufficient knowledge from large volume of data of diabetic patients. Due to
the several social impacts of a specific disease, DM is considered to be one of the
main priorities in medical science research, which indeed produces a large amount
of data. Therefore ,in areas of diagnosis, management and other related clinical
administration, machine learning and data mining approaches in DM have proved
its usage efficiently in all aspects. Hence, in the framework of this study, to review
the current literature on machine learning in diabetes research, a great deal of effort
is made.
1.1.4 SCOPE
Across the Globe, there is a necessity of regional studies for Diabetes prediction.
If an early diagnosis is done,then these results can be very helpful to reduce the
number of treatments a patient has to undergo during late diagnosis of the patient.
He/she can control his/her diet accordingly and appropriate measures can be taken
before hand.Theoretically, this project aims in bringing the analysis reports of sev-
eral supervised learning algorithms such as KNN,Logistic regression,Random For-
est,Decision Tree.Artificial neural network and SVM has also been used simultane-
ously. The analysis finally results in bringing Random Forest and Neural Networks
with 79 percent and 82 percent respectively.
School of Computer Science Engineering, KIIT, BBSR 2

Chapter 2
Literature Survey
Diabetes mellitus is a metabolic disorder which persists for a longer duration and
is mainly distinctive of blood sugar.In this disorder, the body cannot make use of the
insulin as a result of which the blood glucose level increases.Studies have shown that
diabetes is the most pervasive disorder and every year diabetic patients are increasing
by 12 per cent in number.In this disorder, the action of insulin does not respond to
the cells of the body, and the insulin secretion done by the pancreas is not enough
to overcome this resistance. The sugar starts accumulating in the bloodstream and
hence the cells gets deprived of it which in turn leads to loss of energy resulting a
person to be diabetic.This project deals with the comparative and predictive analysis
of diabetes using machine learning concepts and algorithms.
Using machine learning concepts, the model has been created based on a pre-
defined data-set based on insulin level,BMI,skin type,diastolic blood pressure and
so on .The data set is classified into training and testing data-sets.The training input
and output data-sets are expected to create the model.The testing input data sets are
then used for predicting the outcome of the model,which is then compared to testing
output data sets.The following supervised machine learning models have been used
for the comparative study:
a. Regression : It is a measure related to statistical observation which portrays
the relationship strength among one dependent variable and a number of other
changing variables.
b. Support Vector Machine : Support Vector Machine is an example of supervised

machine learning algorithm which can be helpful in the cases of both classifica-
tion or regression challenges.
c. KNN : K nearest neighbours is considered to be an algorithm that keeps all avail-

able cases and categories new cases based on a similarity measure. KNN has

been made use of in estimation of statistical data and recognition of patterns.
d. Decision Tree and Random Forest : Decision trees are a type of model used for
both classification and regression. Trees answer sequential questions which send
us down a certain route of the tree given the answer. The model behaves with if
this than that conditions ultimately yielding a specific result.
The concepts of neural networks has also been used for this project.Keras is an API
for high-level neural networks, available in Python and has the capability of running
on top of TensorFlow, CNTK, or Theano. Here, we also tried Keras running on
Theano. It was developed with a prime goal on enabling experimentation on a faster
basis. Being able to see the change from idea to result without being dealyed for a
longer time is key to doing good research.
Using these models ,a comparative study has been done on the basis of which
a model with higher accuracy can be used for the prediction of diabetes.A deep
study has been done on various models and the efficiency of each model has been
compared and studied.

Chapter 3
PROBLEM DEFINITION
Conventionally, the doctors would see the reports of different medical tests and
after examination of the reports, they would conclude whether the person is having
diabetes or not. In this case a doctor may miss some of the parameters and might
predict wrong. The proposed system will help a doctor in predicting more accurately
based on its computational analysis of previous data-sets. Our goal is to make such
system which can predict and diagnose diabetic patients using previous data-sets and
the test results of patients through different machine learning techniques.

Chapter 4
Project Planning
4.1 Gantt chart
Figure 4.1: Gantt chart
A Gantt chart is a another model of bar chart that describes a project schedule.
This chart lists out the tasks to be maintained on the vertical axis, and intervals of
time to be shown on the horizontal axis. The width of the horizontal bars in the
graph to be plotted portrays the time-duration of each activity. Gantt charts models
the starting and the finishing dates of the terminal as well as the summary elements
of a project.

Chapter 5
Requirement Analysis
5.1 TECHNOLOGY PREQUIREMENTS:
a. Anaconda Distribution
b. Python 3.5.
c. Jupyter Notebook
d. Scikit learn
e. Pandas
f. Matplotlib
g. Seaborn
h. Keras
i. Numpy.
5.2 DATA REQUIREMENTS:
The data requirements of the project are fulfilled by Pima Indian Diabetes dataset.
It consists of total 768 entries consisting of eight features and one output column
consisting of values 0 and 1.
The features include:
a. No of times pregnant
b. Plasma glucose concentration in an oral glucose tolerance test

c. Diastolic blood pressure (mm/Hg)
d. 2 hour serum insulin (U/ml)
e. Body mass Index(kg/m2)
f. Diabetes Pedigree Function
g. Age(Years)

Chapter 6
System Design
In this Project, Pima Indian diabetes data-set is chosen for EDA and building ma-
chine learning models to predict if a person is having diabetes or not based on No
of times pregnant, Plasma glucose concentration in an oral glucose tolerance test,
Diastolic blood pressure (mm/Hg), 2 hour serum insulin (U/ml), Body mass In-
dex(kg/m2), Diabetes Pedigree Function and Age(Years).
Figure 6.1: SYSTEM DESIGN
6.1 EXPLORING THE DATA
Outcome column is the target column of our data-set. Outcome ’1’ indicates the
patient have diabetes and ’0’ indicates the patient does not have diabetes. Other
features in the data-set will be provided as an input to the Machine Learning models.
6.2 STRATIFICATION
We split the Pima Indian data-set into two part one into train data-set and another
test data-sets.This splitting is completely random. So, the instances of each class
label or outcome in the train or test data-sets is random. Hence we may have more
number of instances of having Diabetes (class 1) in training data and less instances

of not having Diabetes (class 2) in our training data-set. So during classification,

we may have accurate predictions for class 1 but not for class 2. So, we perform
stratification, resulting in proportionate data for all the classes in both type of data.
6.3 SUPERVISED LEARNING MODEL
We used Logistic Regression, Support Vector Machine, K-nearest Neighbour, Deci-

sion Tree and Random Forest.
6.4 K-FOLD CROSS VALIDATION
The Pima Indian data-set has imbalance in classes, like class zero has more instances
than class one. In such situation, it is beneficial to train the model in each and every
instances of the data-set. Once the training is completed we take average of all the
noted accuracies over the data-set.In K-fold cross validation, we have to segregate
the data-set into K number of subsets. Then we try to train the model on K-1 parts
and tesst the model on the leftover 1 part. We continue this process until part is
considered for testing and training on the remaining K-1. After completion the above
process, we take the mean of the accuracies and errors to get an average accuracy of
the algorithm.For certain subset the algorithm may under-fit while for a certain other
it may over-fit but with K-fold cross validation we create a generalized model.
6.5 CALCULATING THE ACCURACY
Accuracy Score of all the supervised algorithm is calculated then we used K-Fold
Cross Validation on the Models and calculate the cross validation score.Here, we
can see a-lot more improvement in few models.
6.6 ENSEMBLING
In Ensembling technique we create multiple models and then combine those models
to improved results. This methods produces more accurate result than a standalone
model would. Base models is the models used to create ensemble models. Ensem-
bling is done via the Voting Ensemble. Voting is the appropriate way of merging the
predictions from multiple machine learning models. First create two or more stan-
dalone models from the training data-set. A Voting Classifier is expected to wrap

standalone models and tries to take the mean of the predictions of the sub-models.
6.7 NEURAL NETWORK MODEL
We used Keras for running the neural network. It runs Theano at the backend. The
Pima Indian Diabetes dataset is relatively small for neural network applications and
to get a better accuracy. I believe the ensemble model above is better approach for
this dataset.
6.8 OPTIMIZING THE NEURAL NETWORK MODEL
We optimize various parameters of the Neural Network like Batch Size, Epoch, Ac-
tivation, Initialization and Number of Neurons in Hidden Layer using Grid Search
Method.
6.9 CREATING REPORT ABOUT THE MODEL
We generate the Classification Report, Accuracy Score and Confusion Matrix of the
Neural Network Model.

Chapter 7
System Testing
7.1 TESTING NEURAL NETWORK
Figure 7.1: ACTUAL OUTCOME VS PREDICTED OUTCOME FOR THE FIRST 25 ROWS

Chapter 8
Implementation
A. DATA ACQUISITION:
Data acquisition is the method by which data is acquired from various sources
like clipboard, CSV, JSON, SQL, HTML by web scrapping, SAS and etc.In our
project we acquire data from a CSV file name diabetes.csv.
CODE:
diab = pd.read csv(’diabetes.csv’)
B. DATA CLEANING:
Data cleaning is the techniques by which the redundant information are removed
from the data. It is not most interesting task but it is very important as it can
make or break a machine learning project.Unwanted observation in the data-
set are duplicate information and irrelevant information.There might be missing
data problem in a data-set.The most common way of dealing with this problem
is dropping the missing observation or imputing the missing data based on other
observation.
CODE:
diab[diab[’Glucose’]==0]
columns = [’Glucose’, ’BloodPressure’,
’SkinThickness’,’Insulin’,’BMI’,]
diab[col].replace(0,np.NaN,inplace = True)
diab.dropna(inplace=True)
C. DATA PREPROCESSING:
Data preprocessing is like transforming the data before feeding the data to a ma-
chine learning algorithm. It is basically converting the data from raw format
into clean data. It includes Rescale Data, Binarize Data, Standardize Data. In
our project we have used Standardize Data.

CODE:
X = dataset[:,:8]
Y = dataset[:,8].astype(int)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
X standardized = scaler.transform(X)
Figure 8.1: DESCRIPTION OF THE STANDARDIZED DATA
D. EXPLORATORY DATA ANALYSIS: EDA is the technique of analyzing the

data using visual methods like graphs and chart.
CODE:
diab.iplot(theme=’solar’)
Figure 8.2: PLOTTING THE DIABETES DATA-SET

CODE:
diab.iplot(theme=’solar’)
columns=diab.columns[:8]
plt.subplots(figsize=(18,15))
length=len(columns)
for i,j in itertools.zip longest(columns,range(length)):
plt.subplot((length/2),3,j+1)
plt.subplots adjust(wspace=0.2,hspace=0.5)
diab[i].hist(bins=20,edgecolor=’black’)
plt.title(i)
plt.show()
Figure 8.3: HISTOGRAM OF ALL THE FEATURES IN THE DATA-SET

CODE:
diab1=diab[diab[’Outcome’]==1]
plt.subplots(figsize=(18,15))
length=len(columns)
for i,j in itertools.zip longest(columns,range(length
plt.subplot((length/2),2,j+1)
plt.subplots adjust(wspace=0.2,hspace=0.5)
diab1[i].hist(bins=20,edgecolor=’black’)
plt.title(i)
plt.show()
Figure 8.4: HISTOGRAM OF ALL THE FEATURES OF DIABETES PATIENT USING HUE

CODE:
plt.figure(figsize=(12,28*4))
gs = gridspec.GridSpec(28, 2)
for i, cn in enumerate(diab[columns]):
ax = plt.subplot(gs[i])
sns.distplot(diab[cn][diab.Outcome == 1], bins=20,color=’c’)
sns.distplot(diab[cn][diab.Outcome == 0], bins=20,color=’r’)
ax.set xlabel(’’)
plt.legend(diab["Outcome"])
ax.set title(’histogram of feature: ’ + str(cn))
plt.show()
Figure 8.5: HISTOGRAM

CODE:
sns.pairplot(data=diab,hue=’Outcome’,diag kind=’kde’,palette=’husl’)
plt.show()
Figure 8.6: PAIR PLOT OF ALL THE FEATURES

CODE:
sns.heatmap(diab[diab.columns[:8]].corr(),annot=True,linewidths=.5)
fig=plt.gcf()
fig.set size inches(8,6)
plt.show()
Figure 8.7: HEAT MAP
E. CREATING MACHINE LEARNING MODELS, TRAINING THE MODEL

AND PREDICTING THE RESULT:
I. SUPERVISED LEARNING:
Supervised learning algorithms are used for training the model using la-
beled instances such as an input where the desired outcome is known.
The algorithm uses a set of inputs along with the corresponding correct
outputs, and the algorithm learns by comparing its actual output with ex-
pected outputs to find errors and then modifies the model accordingly.
Supervised learning uses patterns to predict the values of unlabeled data
through methods like classification, regression, prediction and gradient
boosting. Supervised learning is commonly used in applications where
historical data predicts likely future events.

a. Logistic Regression:
Figure 8.8: LOGISTIC REGRESSION
b. Support Vector Machine:
Figure 8.9: SUPPORT VECTOR MACHINE
c. K-Nearest Neighbour:
Figure 8.10: K NEAREST NEIGHBOUR
d. Decision Tree:
Figure 8.11: DECISION TREE

e. Random Forest:
Figure 8.12: RANDOM FOREST
f. K-Fold Cross Validation :
Figure 8.13: MEAN CROSS VALIDATION SCORE OF ALL SUPERVISED ALGORITHM
g. Change in Accuracy after K-Fold Cross Validation
Figure 8.14: CHANGE IN ACCURACY OF THE MODELS AFTER CROSS VALIDATION
Figure 8.15: BAR PLOT OF CHANGE IN ACCURACY

h. Plotting the Accuracy of different Supervised Learning Algorithm

box=pd.DataFrame(accuracy,classifiers)
boxT = box.T
boxT.iplot(kind=’box’,theme=’solar’)
Figure 8.16: BOX PLOT OF THE ACCURACY OF SUPERVISED LEARNING
II. REINFORCEMENT LEARNING
Reinforcement learning has three primary components: the agent which

is the learner or the decision maker , the environment the agent interacts
with and actions that the agent performs.
The primary objective of the agent is to choose actions that results in a
maximized output over a given amount of time.The agent will reach the
goal much faster by following a good policy , thus its goal is to perceive
the best policy.
a. Neural Network:
An Artificial Neural Network (ANN) is an information processing model
that is inspired by the way our biological nervous systems, such as
the brain, process information. The key element of this paradigm is
the novel structure of the information processing system. It is com-
posed of a large number of highly interconnected processing elements
(neurones) working in unison to solve specific problems. ANNs, like
people, learn by example. An ANN is configured for a specific ap-
plication, such as pattern recognition or data classification, through a
learning process. Learning in biological systems involves adjustments

to the synaptic connections that exist between the neurones. This is

true of ANNs as well.
i. Creating the Neural Network Model:
Figure 8.17: NEURAL NETWORK MODEL
ii. Optimizing the Batch Size and Epoch of the Neural Network
Model:
Figure 8.18: BATCH SIZE AND EPOCH
Figure 8.19: BEST BATCH SIZE AND EPOCH

iii. Optimizing the Learning Rate and Dropout of the Neural Network
Model:
Figure 8.20: LEARNING RATE AND DROPOUT RATE
Figure 8.21: BEST LEARNING RATE AND DROPOUT RATE

iv. Optimizing the Initialization and Activation of the Neural Net-

work Model:
Figure 8.22: ACTIVATION AND INITIALIZATION
Figure 8.23: BEST ACTIVATION AND INITIALIZATION

v. Optimizing the Number of Neuron of the Neural Network Model:
Figure 8.24: NEURONS IN EACH LAYER
Figure 8.25: BEST NEURONS IN EACH LAYER
vi. Predicting the outcome using test Data:

y pred = grid.predict(X standardized)
vii. Accuracy Score, Classification Report and Confusion Matrix:
Figure 8.26: ACCURACY SCORE, CLASSIFICATION REPORT AND CONFUSION MATRIX

Chapter 9
Result of Project
9.1 SUPERVISED LEARNING
Figure 9.1: BAR PLOT SHOWING THE ACCURACY OF THE SUPERVISED MODELS
Figure 9.2: BOX PLOT OF THE ACCURACY OF SUPERVISED LEARNING

9.2 REINFORCEMENT LEARNING
Figure 9.3: ACCURACY SCORE AND CLASSIFICATION REPORT FOR NEURAL NETWORK
Figure 9.4: CONFUSION MATRIX FOR THE NEURAL NETWORK

Chapter 10
Conclusion
In this project, our aim is to diagnose Diabetes with maximum accuracy using sev-
eral supervised learning algorithms of Machine Learning. Algorithms such as Logis-
tic Regression,KNN,Random Forest, Decision Tree along with other categories like
Artificial Neural Network,support vector machines are used in this process to com-
pare the results of all the above said algorithms and find out the best one. Neural
network and Random Forest has been found out to be the most accurate algorithm
in the prediction of Diabetes in real case scenario. The accuracy of the system is
obtained to be 81.88 percent.

Chapter 11
Future Scope
A. Ensembling for Supervised Learning: In Ensembling technique we create mul-

tiple models and then combine those models to improve results. This method
produces more accurate result than a standalone model would actually produce.
Base models is the model which helps to create ensemble models. Ensembling
is done via the Voting Ensemble. Voting is the appropriate way of merging the
predictions from multiple machine learning models. First create two or more
standalone models from the training data-set. A Voting Classifier is used to
wrap standalone models and take the mean of the predictions of the sub-models.
So, we can use Esemble Learner to increase the recall value of the diabetes pa-
tient. We have tried to apply ensemble learning between Logistic Regression
and Linear SVM but the overall accuracy decreased. We will try to do this as a
future work.
Figure 11.1: ENSEMBLE LEARNING

B. Parallel Processing for Neural Network:

While optimizing the parameter of our Neural Network using Grid Search we
were doing it in Concurrent Processing. So, we would like do it in Parallel
Processing in future to decrease the training time.
Figure 11.2: PARALLEL PROCESS THE ABOVE TASK

References
[1] https://www.youtube.com/watch?v=p69khggr1Jo
[2] https://www.coursera.org/courses?languages=enquery=machine+learning.
[3] https://www.webmd.com/diabetes/type-2-diabetes
[4] https://dzone.com/articles/predicting-diabetes-using-machine-learning-
approac
[5] https://www.youtube.com/watch?v=g9a8dyDtEYo
[6] https://www.youtube.com/watch?v=s-9Qqpv2hTY

Diabetes

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Diabetes

Încărcat de

Drepturi de autor:

Formate disponibile

A MAJOR-PROJECT REPORT

“A COMPARATIVE STUDY ON PREDICTION

In Partial Fulfillment of the Requirement for the Award of

SOUVIK PODDER 1405171

UNDER THE GUIDANCE OF

SCHOOL OF COMPUTER ENGINEERING

In Partial Fulfillment of the Requirement for the Award of

SOUVIK PODDER 1405171

UNDER THE GUIDANCE OF

SCHOOL OF COMPUTER ENGINEERING

SOUVIK PODDER 1405171

(Prof.HARISH KUMAR PATTNAIK) (Prof. PINAKI CHATTERJEE)

SOUVIK PODDER 1405171

The phenomenal advancement in biotechnology and health sciences have led to

4.1 Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6.1 SYSTEM DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . 10

7.1 ACTUAL OUTCOME VS PREDICTED OUTCOME FOR THE

8.1 DESCRIPTION OF THE STANDARDIZED DATA . . . . . . . 15

9.1 BAR PLOT SHOWING THE ACCURACY OF THE SUPER-

11.1 ENSEMBLE LEARNING . . . . . . . . . . . . . . . . . . . . . . 31

1.1 MACHINE LEARNING

Machine learning is very uselful in automation of analytical model building by us-

1.1.2 APPLICATION OF MACHINE LEARNING

Machine Learning is used in various real world application

b. Web search results.

c. Credit scoring and next-best offers.

d. Prediction of equipment failures.

g. Text Sentiment Analysis.

School of Computer Engineering, KIIT, BBSR 1

h. Pattern and image recognition.

i. Email spam filtering.

In context to predictive analysis, several machine learning algorithms helps us to

School of Computer Science Engineering, KIIT, BBSR 2

b. Support Vector Machine : Support Vector Machine is an example of supervised

c. KNN : K nearest neighbours is considered to be an algorithm that keeps all avail-

School of Computer Engineering, KIIT, BBSR 3

been made use of in estimation of statistical data and recognition of patterns.

School of Computer Science Engineering, KIIT, BBSR 4

School of Computer Engineering, KIIT, BBSR 5

4.1 Gantt chart

Figure 4.1: Gantt chart

School of Computer Engineering, KIIT, BBSR 6

5.1 TECHNOLOGY PREQUIREMENTS:

5.2 DATA REQUIREMENTS:

b. Plasma glucose concentration in an oral glucose tolerance test

School of Computer Engineering, KIIT, BBSR 7

c. Diastolic blood pressure (mm/Hg)

d. 2 hour serum insulin (U/ml)

e. Body mass Index(kg/m2)

f. Diabetes Pedigree Function

School of Computer Science Engineering, KIIT, BBSR 8

Figure 6.1: SYSTEM DESIGN

6.1 EXPLORING THE DATA

School of Computer Engineering, KIIT, BBSR 9

of not having Diabetes (class 2) in our training data-set. So during classification,

6.3 SUPERVISED LEARNING MODEL

We used Logistic Regression, Support Vector Machine, K-nearest Neighbour, Deci-

6.4 K-FOLD CROSS VALIDATION

6.5 CALCULATING THE ACCURACY

School of Computer Science Engineering, KIIT, BBSR 10

6.7 NEURAL NETWORK MODEL

6.8 OPTIMIZING THE NEURAL NETWORK MODEL

6.9 CREATING REPORT ABOUT THE MODEL