Sunteți pe pagina 1din 5

IDL - International Digital Library Of

Technology & Research


Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

Prediction of heart disease using classification


mining technique on spark
Rashmi G Saboji
Computer Science &Engineering
C.M.R Institute of Technology
Bangalore, India
rashmikaneri@gmail.com

Abstract: This paper identifies the (EHR). Concurrently, there is fast progress are being
made in clinical analytics, such as techniques for
increasing health care data which is being accumulated
analyzing large volumes of data and derive new
digitally every day. The healthcare industry is
insights from that analysis, which is known as big data
becoming very data intensive. Worldwide digital
analytics. As a result of this, we can utilize remarkable
healthcare data is estimated to be equal to 500
opportunities provided by big data to reduce the costs
petabytes (1015 bytes), and is expected to reach 25
of health care as well as diagnosing the diseases.In this
exabytes (1018 bytes) in 2020 [6].In this paper, heart
paper, heart disease is one such disease selected
disease is one such disease selected among variety of
among variety of disease in healthcare. Heart disease
disease in healthcare. The purpose of this work is to
is a general name for a variety of diseases. Heart
predict the diagnosis of heart disease with reduced
disease symptoms may vary depending on the specific
number of attributes. Each dataset stored in HDFS is
type of heart disease.
classified based on attributes. This prediction solution
using random forest on apache spark gives massive
The hospitalsuse the hospital database systems to store
opportunity for health care analysts to deploy this
and manage their patient data. These systems generate
solution on ever changing, scalable big data landscape
large volumes of data, but these data are rarely used to
for insightful decision making.
support insightful clinical decision making.

Keywords: Spark, HDFS, Heart disease, Random So by using big data with data mining algorithms
forest, verification makes it possible to do many things such as,identify
healthcare trends, prevent diseases, and diagnose the
diseases and so on.

1. INTRODUCTION 2. OBJECTIVES
The health care system is rapidly adopting electronic
The purpose of this work is to predict the diagnosis of
health records, which will drastically increase the
heart disease with reduced number of attributes. Each
quantity of clinical datas that are available digitally

IDL - International Digital Library 1|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


dataset stored in HDFS is classified based on attributes. 3. METHODOLOGY
Here thirteen attributes and one class is involved in We put into outcomes of heart disease prediction and
predicting heart disease. This prediction solution using accuracy competencies on the Spark and Hadoop
random forest on apache spark gives massive platform for the reason that attribute data sets are
opportunity for health care analysts to deploy this going to most likely scales out to several machines,
solution on ever changing, scalable big data landscape and representation of whole solution on abstract
for insightful decision making. architecture level is shown in Fig 3.1. Firstly, the
The scope of this project mainly deals with spark eco system need to be understood in order to
Data analysis part to improve in below 4 issues take the advantage of its functionalities and support for
machine learning libraries. The Fig 3.2 show cases the
1. Complexity of the analysis-For some analysis ecosystem of spark including its underlying resource
algorithms, the computing time increases manager namely-YARN and dispersed file system that
dramatically even with small amounts of data is HDFS. Next task is to collect heart disease datasets
growth. in csv files. These datasets needs to processed, which
2. Accuracy in prediction- Different data mining means, datasets are labeled with class.
algorithms in classification, clustering,
regression and association have different
accuracy points when it comes to prediction.
3. Scale of the data Even for simple data
analysis, it could take several days, even
months, to obtain the result when data is very
large (e.g. zettabytes scale).
4. Parallelization of computing model For those
Fig 3.1: System Architecture
computationally intense problems, we can
parallelize the analysis so that the problem can
be solved by distributing tasks over many
computers.

The main objective of this work is to identify the key


patterns or features from the
medical data using the classifier model. The attributes
that are more relevant to heart disease diagnosis can be
observed. This will help the medical practitioners to
understand the root causes of disease in depth.

Fig 3.2: Spark Eco System

The class is simply numerical representation of heart


disease prediction based on attribute values. Class 0
means absence of heart disease and class 1 being

IDL - International Digital Library 2|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


presence of heart disease. Once the data is collected in and 1 being present. Due to personal security
csv, it is stored in HDFS as it provides great fault patients personal identification information replaced
tolerance. Then the data is extracted and parsed in with dummy values.
order to handle the missing values of attributes.
Finally random forest is used to predict the newly ForPrediction and Accuracy:
arrived unsupervised, label less datasets class. And
same algorithm is used to find its accuracy over the Datasets are extracted from
increase of training data set which addresses the issue HDFS.
of scalability on the big data. Finally we check the Datasets are parsed to fill in
computation time of algorithm on spark to address the missing values to provide
issue of computational complexity. It also shows that complete supervised datasets.
error rate decreases as accuracy increases over the Then datasets are divided into
increase of training data set. training dataset and testing
datasets in 70:30 proportion.
It is then applied to train the
random forest model with
4. IMPLEMENTATION optimized parameters as
explained above.
The heart disease datasets are collected from source Evaluate model on test instances
as given below [12]: and compute test error.
The UCI machine learning is most widely used Based on the model on the test
repository which contains different datasets from instances, prediction can be
different locations. These data sets are used for data found.
mining and machine learning purposes. As for heart Based on the comparison of
disease prediction, data is collected from Cleveland, previous label value of test data
Switzerland and Hungary. Below contains further and predicted value by algorithm,
information about source and attributes. accuracy is evaluated.
1. Hungarian Institute of Cardiology. Budapest:
AndrasJanosi, M.D.
2. University Hospital, Zurich, Switzerland: William
Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias
Pfisterer, M.D. 5. OUTCOMES
4. V.A. Medical Center, Long Beach and Cleveland
Clinic Foundation: Robert Detrano, M.D., Ph.D.[] Below graph showcases the outcome of random forest
implementation on spark. The same prediction model
The "num" attributes notify to the presence of heart is built using Nave Bayes but below figure clearly
disease in the patient. The range of this attribute is shows that Bayes prediction accuracy does not reach
from 0 (no presence) to 4.Most of the experiments expected accuracy level as compared to random forest.
associated with Cleveland database are focused on Fig5.1 depicts difference in accuracy performance
absence (Num value 0) and presence (Num between Random forest and Nave Bayes w.r.t
values from 1 to 4). For our experimentation, we are increase in training datasets which are stored in HDFS.
using 2 classes for prediction, that is 0 being absent

IDL - International Digital Library 3|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


Fig5.2 shows the increase in accuracy as training overhead. As the data advances there will not be any
datasets used for prediction increases. decrease in the performance of the system. This
experimental outcome on heart disease attributes
Accuracy Performance validate the efficiency of accuracy in prediction and
Comparision computation time in colossal scale w.r.t scalable, ever
growing data.
150%
In the future it can be validated on larger supervised
100% dataset in colossal scale running on cluster setup with
even more optimized parameter of algorithm. It can
50%
also be validated on unsupervised datasets to check
0% prediction accuracy.
200 400 600
REFERENCES
Random Forest Nave Bayes
[1] Mu-HsingKuo, Dillon Chrimes, Belaid Moa, Wei
Fig 5.1: Accuracy comparison chart Hu "Design and Construction of a Big Data Analytics
Framework for Health Applications" 2015 IEEE
International Conference on Smart
City/SocialCom/SustainCom together with DataCom
100% 2015 and SC2 2015
98%
Random Forest
600, 98%
[2] K. Rajalakshmi1* and K. Nirmala2 "In Heart
96% 400, 96% Disease Prediction with MapReduce by using
Accuracy

94% Weighted Association Classifier and K-Means" Indian


92% Journal of Science and Technology, Vol 9(19), DOI:
10.17485/ijst/2016/v9i19/93827, May 2016
90%
88% 200, 88% [3] Purushottam, Prof. (Dr.) Kanak Saxena, Richa
Sharma "Efficient Heart Disease Prediction System
86%
using Decision Tree" International Conference on
0 200 400 600 800
Computing, Communication and Automation
No of Records (ICCCA2015)

Fig 5.2: Accuracy graph of Random Forest [4] Jian Fu, Junwei Sun, Kaiyuan Wang "SPARKA
Big Data Processing Platform for Machine Learning",
2016 International Conference on Industrial
Informatics - Computing Technology, Intelligent
CONCLUSION AND FUTURE Technology, Industrial Information Integration
ENHANCEMENT
[5] Patil R Priya, Kinariwala A S, "Automated
Diagnosis of Heart Disease using Random Forest
Algorithm" International Journal of Advance
Utilizing big data analytics, the healthcare data being Research, Ideas and Innovations in Technology
generated from time to time in medical field can be
processed faster for predicting diseases with none

IDL - International Digital Library 4|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


[6] Sun J., Reddy C.K. Big Data Analytics for
Healthcare. Tutorial presentation at the SIAM
International Conference on Data Mining, Austin, TX,
2013.

[7] Hughes G. How big is 'Big Data' in healthcare?


URL:http://blogs.sas.com/content/
hls/2011/10/21/how-big-is-big-data-
inhealthcare/[accessed 2014-9-26].

[8] Herland et al. Journal of Big Data 2014, 1:2 "A


review of data mining using big data in health
informatics".

[9] https://hortonworks.com/apache/hdfs/

[10] https://spark.apache.org/

[11] http://data-flair.training/blogs/hadoop-mapreduce-
vs-apache-spark/

[12]https://archive.ics.uci.edu/ml/datasets/Heart+Disea
se

[13]https://www.stat.berkeley.edu/~breiman/RandomF
orests/

[14] AnkushVerma, Ashik Hussain Mansuri, and Dr.


Neelesh Jain Big Data Management Processing with
Hadoop MapReduce and SparkTechnology: A
Comparison" 2016 Symposium on Colossal Data
Analysis and Networking (CDAN)

[15] Amit Nandi, Spark for Python Developers

IDL - International Digital Library 5|P a g e Copyright@IDL-2017

S-ar putea să vă placă și