Documente Academic
Documente Profesional
Documente Cultură
ON
Submitted to
KIIT, Deemed to be University
BACHELOR’S DEGREE IN
COMPUTER ENGINEERING
BY
Submitted to
KIIT, Deemed to be University
BACHELOR’S DEGREE IN
COMPUTER ENGINEERING
BY
CERTIFICATE
This is certify that the project entitled
“A COMPARATIVE STUDY ON PREDICTION AND
DIAGNOSIS OF DIABETES“
submitted by
is a record of bonafide work carried out by them, in the partial fulfilment of the
requirement for the award of Degree of Bachelor of Engineering in Computer Sci-
ence at Kalinga Institute of Industrial Technology (KIIT), Deemed to be University,
Bhubaneswar. This work is done during year 2017-2018, under our guidance.
Date: 06 / 04 / 2018
Apart from our efforts the success of this project depends largely on the encour-
agement and guidelines of many others. We take this opportunity to express our
gratitude to the people who have been instrumental in the successful completion of
this project. We take immense pleasure in thanking and warmly acknowledging the
continuous encouragement, invaluable supervision, timely suggestions and the in-
spired guidance offered by our project mentor Prof. Harish Kumar Pattnaik, School
of Computer Science and Engineering, Kalinga Institute of Industrial Technology
(KIIT), Deemed to be University, in bringing this report to a successful completion.
We are grateful to Dr S. Mishra, Dean of School of Computer Science and Engineer-
ing, Kalinga Institute of Industrial Technology (KIIT), Deemed to be University for
permitting us to make use of the facilities available in the department to carry out
the project successfully. We also express our sincere thanks to all our friends who
have patiently extended all sorts of help for accomplishing this undertaking. Finally,
we extend our gratefulness to one and all who are directly and indirectly involved in
successful completion of this project work.
Diabetes is one of the most pervasive diseases in the world.It leads to heart attack,
blindness, kidney diseases, etc. The patient needs to visit a diagnostic center, consult
a doctor and wait for the reports. Moreover, every time one wants to get the diagnosis
report, one has to spend money unnecessarily.Early prediction of the disease leads
to treatment of patients before it becomes critical.
Keywords:
Diabetes, Machine Learning, Diagnosis, SVM, Neural Network, KNN, Decision
Tree.
Contents
1 Introduction 2
1.1 MACHINE LEARNING . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 BRIEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 APPLICATION OF MACHINE LEARNING . . . . . . . . 2
1.1.3 PURPOSE . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 SCOPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 4
3 PROBLEM DEFINITION 6
4 Project Planning 7
4.1 Gantt chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Requirement Analysis 8
5.1 TECHNOLOGY PREQUIREMENTS: . . . . . . . . . . . . . . . . 8
5.2 DATA REQUIREMENTS: . . . . . . . . . . . . . . . . . . . . . . 8
6 System Design 10
6.1 EXPLORING THE DATA . . . . . . . . . . . . . . . . . . . . . . 10
6.2 STRATIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.3 SUPERVISED LEARNING MODEL . . . . . . . . . . . . . . . . 11
6.4 K-FOLD CROSS VALIDATION . . . . . . . . . . . . . . . . . . . 11
6.5 CALCULATING THE ACCURACY . . . . . . . . . . . . . . . . . 11
6.6 ENSEMBLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.7 NEURAL NETWORK MODEL . . . . . . . . . . . . . . . . . . . 12
6.8 OPTIMIZING THE NEURAL NETWORK MODEL . . . . . . . . 12
6.9 CREATING REPORT ABOUT THE MODEL . . . . . . . . . . . . 12
7 System Testing 13
7.1 TESTING NEURAL NETWORK . . . . . . . . . . . . . . . . . . 13
8 Implementation 14
9 Result of Project 28
9.1 SUPERVISED LEARNING . . . . . . . . . . . . . . . . . . . . . 28
9.2 REINFORCEMENT LEARNING . . . . . . . . . . . . . . . . . . 29
10 Conclusion 30
11 Future Scope 31
References 32
List of Figures
Chapter 1
Introduction
1.1.1 BRIEF
In our project we have used various supervised learning model and neural net-
work model on Pima Indians Diabetes data-set to predict whether a patient is having
diabetes or not.We made a comparative analysis of all the algorithms by calculating
their accuracy score, classification report and confusion matrix.
a. Fraud detection.
e. Recommendation Engines.
f. Customer Segmentation.
j. Financial Modeling
1.1.3 PURPOSE
1.1.4 SCOPE
Across the Globe, there is a necessity of regional studies for Diabetes prediction.
If an early diagnosis is done,then these results can be very helpful to reduce the
number of treatments a patient has to undergo during late diagnosis of the patient.
He/she can control his/her diet accordingly and appropriate measures can be taken
before hand.Theoretically, this project aims in bringing the analysis reports of sev-
eral supervised learning algorithms such as KNN,Logistic regression,Random For-
est,Decision Tree.Artificial neural network and SVM has also been used simultane-
ously. The analysis finally results in bringing Random Forest and Neural Networks
with 79 percent and 82 percent respectively.
Chapter 2
Literature Survey
Diabetes mellitus is a metabolic disorder which persists for a longer duration and
is mainly distinctive of blood sugar.In this disorder, the body cannot make use of the
insulin as a result of which the blood glucose level increases.Studies have shown that
diabetes is the most pervasive disorder and every year diabetic patients are increasing
by 12 per cent in number.In this disorder, the action of insulin does not respond to
the cells of the body, and the insulin secretion done by the pancreas is not enough
to overcome this resistance. The sugar starts accumulating in the bloodstream and
hence the cells gets deprived of it which in turn leads to loss of energy resulting a
person to be diabetic.This project deals with the comparative and predictive analysis
of diabetes using machine learning concepts and algorithms.
Using machine learning concepts, the model has been created based on a pre-
defined data-set based on insulin level,BMI,skin type,diastolic blood pressure and
so on .The data set is classified into training and testing data-sets.The training input
and output data-sets are expected to create the model.The testing input data sets are
then used for predicting the outcome of the model,which is then compared to testing
output data sets.The following supervised machine learning models have been used
for the comparative study:
a. Regression : It is a measure related to statistical observation which portrays
the relationship strength among one dependent variable and a number of other
changing variables.
d. Decision Tree and Random Forest : Decision trees are a type of model used for
both classification and regression. Trees answer sequential questions which send
us down a certain route of the tree given the answer. The model behaves with if
this than that conditions ultimately yielding a specific result.
The concepts of neural networks has also been used for this project.Keras is an API
for high-level neural networks, available in Python and has the capability of running
on top of TensorFlow, CNTK, or Theano. Here, we also tried Keras running on
Theano. It was developed with a prime goal on enabling experimentation on a faster
basis. Being able to see the change from idea to result without being dealyed for a
longer time is key to doing good research.
Using these models ,a comparative study has been done on the basis of which
a model with higher accuracy can be used for the prediction of diabetes.A deep
study has been done on various models and the efficiency of each model has been
compared and studied.
Chapter 3
PROBLEM DEFINITION
Conventionally, the doctors would see the reports of different medical tests and
after examination of the reports, they would conclude whether the person is having
diabetes or not. In this case a doctor may miss some of the parameters and might
predict wrong. The proposed system will help a doctor in predicting more accurately
based on its computational analysis of previous data-sets. Our goal is to make such
system which can predict and diagnose diabetic patients using previous data-sets and
the test results of patients through different machine learning techniques.
Chapter 4
Project Planning
A Gantt chart is a another model of bar chart that describes a project schedule.
This chart lists out the tasks to be maintained on the vertical axis, and intervals of
time to be shown on the horizontal axis. The width of the horizontal bars in the
graph to be plotted portrays the time-duration of each activity. Gantt charts models
the starting and the finishing dates of the terminal as well as the summary elements
of a project.
Chapter 5
Requirement Analysis
a. Anaconda Distribution
b. Python 3.5.
c. Jupyter Notebook
d. Scikit learn
e. Pandas
f. Matplotlib
g. Seaborn
h. Keras
i. Numpy.
The data requirements of the project are fulfilled by Pima Indian Diabetes dataset.
It consists of total 768 entries consisting of eight features and one output column
consisting of values 0 and 1.
The features include:
a. No of times pregnant
g. Age(Years)
Chapter 6
System Design
In this Project, Pima Indian diabetes data-set is chosen for EDA and building ma-
chine learning models to predict if a person is having diabetes or not based on No
of times pregnant, Plasma glucose concentration in an oral glucose tolerance test,
Diastolic blood pressure (mm/Hg), 2 hour serum insulin (U/ml), Body mass In-
dex(kg/m2), Diabetes Pedigree Function and Age(Years).
Outcome column is the target column of our data-set. Outcome ’1’ indicates the
patient have diabetes and ’0’ indicates the patient does not have diabetes. Other
features in the data-set will be provided as an input to the Machine Learning models.
6.2 STRATIFICATION
We split the Pima Indian data-set into two part one into train data-set and another
test data-sets.This splitting is completely random. So, the instances of each class
label or outcome in the train or test data-sets is random. Hence we may have more
number of instances of having Diabetes (class 1) in training data and less instances
The Pima Indian data-set has imbalance in classes, like class zero has more instances
than class one. In such situation, it is beneficial to train the model in each and every
instances of the data-set. Once the training is completed we take average of all the
noted accuracies over the data-set.In K-fold cross validation, we have to segregate
the data-set into K number of subsets. Then we try to train the model on K-1 parts
and tesst the model on the leftover 1 part. We continue this process until part is
considered for testing and training on the remaining K-1. After completion the above
process, we take the mean of the accuracies and errors to get an average accuracy of
the algorithm.For certain subset the algorithm may under-fit while for a certain other
it may over-fit but with K-fold cross validation we create a generalized model.
Accuracy Score of all the supervised algorithm is calculated then we used K-Fold
Cross Validation on the Models and calculate the cross validation score.Here, we
can see a-lot more improvement in few models.
6.6 ENSEMBLING
In Ensembling technique we create multiple models and then combine those models
to improved results. This methods produces more accurate result than a standalone
model would. Base models is the models used to create ensemble models. Ensem-
bling is done via the Voting Ensemble. Voting is the appropriate way of merging the
predictions from multiple machine learning models. First create two or more stan-
dalone models from the training data-set. A Voting Classifier is expected to wrap
standalone models and tries to take the mean of the predictions of the sub-models.
We used Keras for running the neural network. It runs Theano at the backend. The
Pima Indian Diabetes dataset is relatively small for neural network applications and
to get a better accuracy. I believe the ensemble model above is better approach for
this dataset.
We optimize various parameters of the Neural Network like Batch Size, Epoch, Ac-
tivation, Initialization and Number of Neurons in Hidden Layer using Grid Search
Method.
We generate the Classification Report, Accuracy Score and Confusion Matrix of the
Neural Network Model.
Chapter 7
System Testing
Figure 7.1: ACTUAL OUTCOME VS PREDICTED OUTCOME FOR THE FIRST 25 ROWS
Chapter 8
Implementation
A. DATA ACQUISITION:
Data acquisition is the method by which data is acquired from various sources
like clipboard, CSV, JSON, SQL, HTML by web scrapping, SAS and etc.In our
project we acquire data from a CSV file name diabetes.csv.
CODE:
diab = pd.read csv(’diabetes.csv’)
B. DATA CLEANING:
Data cleaning is the techniques by which the redundant information are removed
from the data. It is not most interesting task but it is very important as it can
make or break a machine learning project.Unwanted observation in the data-
set are duplicate information and irrelevant information.There might be missing
data problem in a data-set.The most common way of dealing with this problem
is dropping the missing observation or imputing the missing data based on other
observation.
CODE:
diab[diab[’Glucose’]==0]
columns = [’Glucose’, ’BloodPressure’,
’SkinThickness’,’Insulin’,’BMI’,]
diab[col].replace(0,np.NaN,inplace = True)
diab.dropna(inplace=True)
C. DATA PREPROCESSING:
Data preprocessing is like transforming the data before feeding the data to a ma-
chine learning algorithm. It is basically converting the data from raw format
into clean data. It includes Rescale Data, Binarize Data, Standardize Data. In
our project we have used Standardize Data.
CODE:
X = dataset[:,:8]
Y = dataset[:,8].astype(int)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X)
X standardized = scaler.transform(X)
CODE:
diab.iplot(theme=’solar’)
columns=diab.columns[:8]
plt.subplots(figsize=(18,15))
length=len(columns)
for i,j in itertools.zip longest(columns,range(length)):
plt.subplot((length/2),3,j+1)
plt.subplots adjust(wspace=0.2,hspace=0.5)
diab[i].hist(bins=20,edgecolor=’black’)
plt.title(i)
plt.show()
CODE:
diab1=diab[diab[’Outcome’]==1]
columns=diab.columns[:8]
plt.subplots(figsize=(18,15))
length=len(columns)
for i,j in itertools.zip longest(columns,range(length
plt.subplot((length/2),2,j+1)
plt.subplots adjust(wspace=0.2,hspace=0.5)
diab1[i].hist(bins=20,edgecolor=’black’)
plt.title(i)
plt.show()
Figure 8.4: HISTOGRAM OF ALL THE FEATURES OF DIABETES PATIENT USING HUE
CODE:
columns=diab.columns[:8]
plt.figure(figsize=(12,28*4))
gs = gridspec.GridSpec(28, 2)
for i, cn in enumerate(diab[columns]):
ax = plt.subplot(gs[i])
sns.distplot(diab[cn][diab.Outcome == 1], bins=20,color=’c’)
sns.distplot(diab[cn][diab.Outcome == 0], bins=20,color=’r’)
ax.set xlabel(’’)
plt.legend(diab["Outcome"])
ax.set title(’histogram of feature: ’ + str(cn))
plt.show()
CODE:
sns.pairplot(data=diab,hue=’Outcome’,diag kind=’kde’,palette=’husl’)
plt.show()
CODE:
sns.heatmap(diab[diab.columns[:8]].corr(),annot=True,linewidths=.5)
fig=plt.gcf()
fig.set size inches(8,6)
plt.show()
I. SUPERVISED LEARNING:
Supervised learning algorithms are used for training the model using la-
beled instances such as an input where the desired outcome is known.
The algorithm uses a set of inputs along with the corresponding correct
outputs, and the algorithm learns by comparing its actual output with ex-
pected outputs to find errors and then modifies the model accordingly.
Supervised learning uses patterns to predict the values of unlabeled data
through methods like classification, regression, prediction and gradient
boosting. Supervised learning is commonly used in applications where
historical data predicts likely future events.
a. Logistic Regression:
c. K-Nearest Neighbour:
d. Decision Tree:
e. Random Forest:
ii. Optimizing the Batch Size and Epoch of the Neural Network
Model:
iii. Optimizing the Learning Rate and Dropout of the Neural Network
Model:
Chapter 9
Result of Project
Figure 9.1: BAR PLOT SHOWING THE ACCURACY OF THE SUPERVISED MODELS
Figure 9.3: ACCURACY SCORE AND CLASSIFICATION REPORT FOR NEURAL NETWORK
Chapter 10
Conclusion
In this project, our aim is to diagnose Diabetes with maximum accuracy using sev-
eral supervised learning algorithms of Machine Learning. Algorithms such as Logis-
tic Regression,KNN,Random Forest, Decision Tree along with other categories like
Artificial Neural Network,support vector machines are used in this process to com-
pare the results of all the above said algorithms and find out the best one. Neural
network and Random Forest has been found out to be the most accurate algorithm
in the prediction of Diabetes in real case scenario. The accuracy of the system is
obtained to be 81.88 percent.
Chapter 11
Future Scope
References
[1] https://www.youtube.com/watch?v=p69khggr1Jo
[2] https://www.coursera.org/courses?languages=enquery=machine+learning.
[3] https://www.webmd.com/diabetes/type-2-diabetes
[4] https://dzone.com/articles/predicting-diabetes-using-machine-learning-
approac
[5] https://www.youtube.com/watch?v=g9a8dyDtEYo
[6] https://www.youtube.com/watch?v=s-9Qqpv2hTY