Documente Academic
Documente Profesional
Documente Cultură
ABSTRACT Various classification methods are applied to predict different diseases, such as diabetes,
tuberculosis, and so on, in medical field. Diagnosis of diabetes can be analyzed by checking the level of
blood sugar of patient with the normal known levels, blood pressure, BMI, skin thickness, and so on. Several
classification methods have been implemented on diabetes. In this paper, the main aim is to build a statistical
model for diabetes data to get better classification accuracy. Extracted features of diabetes data are projected
to a new space using principal component analysis, then, it is modeled by applying linear regression method
on these newly formed attributes. The accuracy obtained by this method is 82.1% for predicting diabetes
which has reformed over other existing classification methods.
I. INTRODUCTION second eigenvector, observe that axis along which the vari-
Analysis of diseases is a tough task in medical field. ance of distance from first axis is greatest and so on. Small
Diagnosing diabetes [2] can be understood by checking number of eigenvectors is represented by matrix of points,
the blood sugar level with normal desired level. In this then to minimize the root mean square error, approximation
manuscript we present a statistical diabetes disease prediction of data is performed for the given number of columns in
model where Principal Component Analysis (PCA) is applied the matrix consider. Thus the original features of PIDD are
to extract attributes of Pima Indian Diabetes Data (PIDD) to approximated with fewer dimensions which are an overall of
a new feature space. These attributes are then modeled using original PIDD.
linear regression model [8] to predict diabetes. Model building provides a good fit to any set of data. Linear
Attributes of PIDD are inspected at different angles to statistical model estimate the unknown dependent PIDD fea-
obtain required information for processing data. So, feature ture value from the known independent PIDD feature values.
extraction is a major step in examining PIDD. The work The representation of relationship between dependent PIDD
concentrates on retrieving feature values from PIDD to a image feature and set of independent multiple PIDD features
new feature space by employing PCA method. These new are known as regression analysis.
set of feature values are inspected for their importance and The work is explained as follows, section 2 describes about
relevance, and are subjected for data mining methods like previous work on PIDD, proposed methodology is explained
LRM to classify the given data for predicting diabetes disease. in section 3, section 4 discusses about findings and section
PCA is reduction method which considers the PIDD as 5 gives about work’s conclusion.
set of rows representing characteristics in a high dimensional
space and all rows are put up to a directions which represents
II. RELATED WORK
the best set of features. Here for original set of attributes
of PIDD, this transformation is applied to obtain an axis Polat et al. [1] proposed a Least Square Support Vector
that contains principle eigenvector where all the points of Machine (LS-SVM) classification method to obtain an accu-
all observations of each feature are spread out. Maximized racy of 79.16% which improvised over previous classifica-
variance of data can be found on this axis. When considering tion methods. Generalized Discriminant Analysis (GDA) was
used at preprocessing stage for discriminating variables of
The associate editor coordinating the review of this manuscript and PIDD and then LS-SVM technique was applied on these
approving it for publication was Yue Zhang. variables for classifying the disease.
105314 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019
H. Roopa, T. Asha: Linear Model Based on PCA for Disease Prediction
1 x1786 . . . x8768
ˆ
b0
bˆ1
bˆ
b̂ = 2
:
:
bˆk
REFERENCES
[1] K. Polat, S. Güneş, and A. Arslan, ‘‘A cascade learning system for
classification of diabetes disease: Generalized discriminant analysis and
least square support vector machine,’’ Expert Syst. Appl., vol. 34, no. 1,
pp. 482–487, 2008.
[2] M. F. Ganji and M. S. Abadeh, ‘‘A fuzzy classification system based on
ant colony optimization for diabetes disease diagnosis,’’ Expert Syst. Appl.,
vol. 38, no. 12, pp. 14650–14659, 2011.
[3] M. Seera and C. P. Lim, ‘‘A hybrid intelligent system for medical data
classification,’’ Expert Syst. Appl., vol. 41, no. 5, pp. 2239–2249, 2014.
[4] S. Sa’di, A. Maleki, R. Hashemi, Z. Panbechi, and K. Chalabi, ‘‘Compari-
son of data mining algorithms in the diagnosis of type II diabetes,’’ Int. J.
Comput. Sci. Appl., vol. 5, no. 5, pp. 1–12, 2015.
[5] R. Bansal, S. Kumar, and A. Mahajan, ‘‘Diagnosis of diabetes mellitus
using PSO and KNN classifier,’’ in Proc. Int. Conf. Comput. Commun.
Technol. Smart Nation (IC3TSN), Oct. 2017, pp. 32–38.
[6] D. K. Choubey, S. Paul, S. Kumar, and S. Kumar, ‘‘Classification of Pima
indian diabetes dataset using naive Bayes with genetic algorithm as an
attribute selection,’’ in Proc. Int. Conf. Commun. Comput. Syst. (ICCCS),
Feb. 2017, pp. 451–455.
[7] H. Roopa and T. Asha, ‘‘Feature extraction of chest X-ray images and
analysis using PCA and kPCA,’’ Int. J. Elect. Comput. Eng., vol. 8, no. 5,
the Figure 6 that near 0.5 cutoff value, TPR value is high. The p. 3392, Oct. 2018.
[8] H. Roopa and T. Asha, ‘‘Analysis of feature ranking methods on
calculated AUC value of training data is 0.8196. X-ray images,’’ in Proc. Int. Conf. ISMAC Comput. Vis. Bio-Eng. Cham,
From PIDD, 20% of data is considered for testing the data. Switzerland: Springer, 2018, pp. 1393–1403.
The snapshot of performance curve of testing data of PIDD [9] UCI Repository of Machine Learning. Accessed: 2017. [Online]. Avail-
able: http://www.ics.uci.edu./~mlearn/ML Repository.html
is shown in Figure 7. The performance curve of LRM is [10] H. Roopa and T. Asha, ‘‘An efficient prediction model to analyze tubercu-
analyzed with respect to accuracy and cutoff value used in losis chest X-ray images,’’ Int. J. Sci. Eng. Res., vol. 10, no. 1, pp. 694–700,
disease prediction. Figure 7 depicts that at cutoff value 0.5, 2019.
[11] D. Sisodia and D. S. Sisodia, ‘‘Prediction of diabetes using classification
LRM prediction accuracy raises above 80%. LRM prediction algorithms,’’ Procedia Comput. Sci., vol. 132, pp. 1578–1585, Jan. 2018.
accuracy falls low when the cutoff value is below 0.5. [12] A. Ahmad, A. Mustapha, E. D. Zahadi, N. Masah, and N. Y. Yahaya,
The snapshot of ROC curve of testing data of PIDD is ‘‘Comparison between neural networks against decision tree in improving
prediction accuracy for diabetes mellitus,’’ in Proc. Int. Conf. Digit. Inf.
shown in Figure 8. In this Figure 8 the prediction ROC curve Process. Commun. Berlin, Germany: Springer, 2011, pp. 537–545.
is distributed over various cutoff values ranging between