Sunteți pe pagina 1din 5

Received July 8, 2019, accepted July 23, 2019, date of publication July 30, 2019, date of current version

August 14, 2019.


Digital Object Identifier 10.1109/ACCESS.2019.2931956

A Linear Model Based on Principal Component


Analysis for Disease Prediction
H. ROOPA 1 AND T. ASHA2
1 Department of Information Science and Engineering, Bangalore Institute of Technology, Bengaluru 560004, India
2 Department of Computer Science and Engineering, Bangalore Institute of Technology, Bengaluru 560004, India

Corresponding author: H. Roopa (roopatejas@gmail.com)

ABSTRACT Various classification methods are applied to predict different diseases, such as diabetes,
tuberculosis, and so on, in medical field. Diagnosis of diabetes can be analyzed by checking the level of
blood sugar of patient with the normal known levels, blood pressure, BMI, skin thickness, and so on. Several
classification methods have been implemented on diabetes. In this paper, the main aim is to build a statistical
model for diabetes data to get better classification accuracy. Extracted features of diabetes data are projected
to a new space using principal component analysis, then, it is modeled by applying linear regression method
on these newly formed attributes. The accuracy obtained by this method is 82.1% for predicting diabetes
which has reformed over other existing classification methods.

INDEX TERMS Principal component analysis, linear regression model, diabetes.

I. INTRODUCTION second eigenvector, observe that axis along which the vari-
Analysis of diseases is a tough task in medical field. ance of distance from first axis is greatest and so on. Small
Diagnosing diabetes [2] can be understood by checking number of eigenvectors is represented by matrix of points,
the blood sugar level with normal desired level. In this then to minimize the root mean square error, approximation
manuscript we present a statistical diabetes disease prediction of data is performed for the given number of columns in
model where Principal Component Analysis (PCA) is applied the matrix consider. Thus the original features of PIDD are
to extract attributes of Pima Indian Diabetes Data (PIDD) to approximated with fewer dimensions which are an overall of
a new feature space. These attributes are then modeled using original PIDD.
linear regression model [8] to predict diabetes. Model building provides a good fit to any set of data. Linear
Attributes of PIDD are inspected at different angles to statistical model estimate the unknown dependent PIDD fea-
obtain required information for processing data. So, feature ture value from the known independent PIDD feature values.
extraction is a major step in examining PIDD. The work The representation of relationship between dependent PIDD
concentrates on retrieving feature values from PIDD to a image feature and set of independent multiple PIDD features
new feature space by employing PCA method. These new are known as regression analysis.
set of feature values are inspected for their importance and The work is explained as follows, section 2 describes about
relevance, and are subjected for data mining methods like previous work on PIDD, proposed methodology is explained
LRM to classify the given data for predicting diabetes disease. in section 3, section 4 discusses about findings and section
PCA is reduction method which considers the PIDD as 5 gives about work’s conclusion.
set of rows representing characteristics in a high dimensional
space and all rows are put up to a directions which represents
II. RELATED WORK
the best set of features. Here for original set of attributes
of PIDD, this transformation is applied to obtain an axis Polat et al. [1] proposed a Least Square Support Vector
that contains principle eigenvector where all the points of Machine (LS-SVM) classification method to obtain an accu-
all observations of each feature are spread out. Maximized racy of 79.16% which improvised over previous classifica-
variance of data can be found on this axis. When considering tion methods. Generalized Discriminant Analysis (GDA) was
used at preprocessing stage for discriminating variables of
The associate editor coordinating the review of this manuscript and PIDD and then LS-SVM technique was applied on these
approving it for publication was Yue Zhang. variables for classifying the disease.

105314 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019
H. Roopa, T. Asha: Linear Model Based on PCA for Disease Prediction

TABLE 1. The eight attributes of PIDD with the class variable.

FIGURE 1. Proposed steps involved in feature extraction and modeling of


PIDD.

Seera and Lim [3] proposed Fuzzy Min–Max neural


network, Regression Tree and Random Forest (FMM-CART-
RF) combined hybrid classification method that achieved an
accuracy of 78.39% for PIDD.
Sa’di et al. [4] classified PIDD using various data mining
algorithms and analyzed that Naïve Bayes performed well
than RBF network and J48 with an accuracy of 76.95%.
Bansal et al. [5] proposed an evolutionary method where B. FEATURE EXTRACTION [7] OF PIDD BY APPLYING PCA
feature selection of PIDD is obtained by Particle Swarm Opti- For the given numeric data matrix containing the feature
mization (PSO) method and then k-Nearest Neighbor (KNN) values of PIDD apply principal component analysis to project
classification technique is applied on these features to achieve the features to a new space.
an accuracy of 77%. The steps involved in feature extraction are
Choubey et al. [6] applied Genetic Algorithm (GA) 1) From each dimensions of PIDD subtract the mean,
for variable selection on PIDD. Then implemented Naive which produces a data set whose mean is zero.
Bayes (NB) on these selected variables for classifying the 2) Calculate the variance matrix between two separate
diabetes disease to obtain an accuracy of 78.69%. dimensions of PIDD.
3) Find the Eigen vector and Eigen values of the matrix
III. PROPOSED METHODOLOGY obtained in step2.
The proposed method includes extraction of new group of 4) Construct the feature vector and take transpose of it.
features from PIDD by employing PCA so that the values Then multiply it with original PIDD to obtain new set
are inspected for their importance and relevance, and are of features projected to new space.
subjected for data mining methods like Linear Regression
Model (LRM) to classify the given data for predicting dia- C. LINEAR REGRESSION MODEL (LRM)
betes disease. The model is illustrated as shown in Figure 1. Regression analysis predict dependent variable value ’y’
based on the set of independent variable values x1 , x2 , x3 ,. . . ,
A. INPUT DATA xk.
The PIDD has been taken from UCI of machine learning The multiple independent variables analysis, linear
repository [9]. In this dataset female patient from Phoenix regression model [10] is represented by equation (1)
and Arizona aged about 21 years are considered. The PIDD
y = b0 + b1 x1+ b2 x2 + . . . + bk xk + ε (1)
consists of ‘8’ actual variables and ‘1’class variable. There
are totally 768 instances out of which 268 instances have class where ‘y’ is the class variable of PIDD data,
variable value ‘1’ and 500 instances have class variable value b0 is a ‘y’ intercept,
‘0’ respectively. The patients are classified as diabetes or ε is an error term and
normal using binary valued variables. If the binary response b1 , b2 , b3,........... , bk are coefficients of x1 , x2 ,
variable represents ‘1’ means ‘‘positive for diabetes’’ and ‘0’ x3, . . . . . . . . . .. . . ,xk. respectively.
means ‘‘negative for diabetes’’. The attributes of PIDD are The dependent variable ‘y’ of PIDD is the class variable,
presented in Table 1. which is predicted using equation (1). ‘y’ is based on the

VOLUME 7, 2019 105315


H. Roopa, T. Asha: Linear Model Based on PCA for Disease Prediction

set of independent variable values x1, x2 , x3, . . . . . . . . . .. . . ,xk


represented by values of comp 1, comp 2, comp 3, comp 4,
comp 5, comp 6, comp 7 and comp 8 respectively.
The fitted line of equation (1) is calculated by principle
of least square method where b0 , b1 , b2 , b3,........... , bk are
chosen in such a way that the Sum of Squares of Error (SSE)
is minimum.
Consider equation (1), For n = 768 observation,
Let
 
y1
 y2 
 
Y=  y3 

 : 
y738
FIGURE 2. LRM results of PIDD.
1 x11 x21 . . .
 
x81
X =  ... ..
.
.. 
. 

1 x1786 . . . x8768
 ˆ 
b0
 bˆ1 
 
 bˆ 
b̂ =  2
 : 
 
 : 
bˆk

b̂ → least square estimates of b0 , b1 , . . . , bk of linear


model.
Least square matrix is obtained by
(x1 x)b̂ = x1 y
FIGURE 3. Performance of training data.
where x1 → is transpose of x matrix.
(x1 x) → coefficient matrix of least square estimates
of bˆ0 , bˆ1 . . . . . . . . . bˆk . LRM. The actual and predicted values of PIDD modeled by
x1 y → gives matrix of constants. LRM are given by confusion matrix. The confusion matrix
Therefore least square solution is obtained by b̂ = (x1 x) −1 . and statistics of training and testing data of PIDD are given
1
x y in Figure 3 and Figure 4 respectively.
Substitute b̂ in equation (1) to get the final fitted line. The confusion matrix of training data as given in
Figure 3 depicts that 614 instances out of 768 instances
D. OUTPUT DATA of PIDD are considered for training. 351 observations are
The model classifies features of PIDD as diabetic or normal. correctly predicted negative for diabetes and 110 observations
are correctly predicted positive for diabetes. So, 461 observa-
IV. RESULTS tions have been correctly classified by LRM. 46 observations
The attributes of PIDD is projected to a new space using PCA. represent incorrect prediction of LRM that resulted positive
Then LRM is applied on the components of PIDD. The results for normal person and 99 observations represent incorrect
of the model are presented in Figure 2. prediction of LRM that resulted negative for diabetic person.
The residuals error between the prediction of LRM and So, 145 observations have been wrongly classified by LRM.
actual results is smaller ranging between −1.0059 and The confusion matrix of testing data as given in
1.2930. And the median value is closer to zero. F-statistics Figure 4 depicts that 154 out of 768 instances of PIDD
shows that model has at least one variable that is significantly are considered for testing. 96 observations are correctly pre-
different than zero. The variables comp1, comp2, comp5, dicted negative for diabetes and 37 observations are correctly
comp6, comp7 and comp8 are more significant indicated by predicted positive for diabetes. So, 133 observations have
∗∗∗ and ∗∗ when compared to comp3 and comp4 respectively, been correctly classified by LRM. 7 observations represent
which are less significant. incorrect prediction of LRM that resulted positive for normal
The components of PIDD are partitioned, where 80% are person and 22 observations represent incorrect prediction
considered for training the model and 20% of data for test- of LRM that resulted negative for diabetic person. So, 29
ing the model respectively. These data are classified using observations have been wrongly classified by LRM.

105316 VOLUME 7, 2019


H. Roopa, T. Asha: Linear Model Based on PCA for Disease Prediction

FIGURE 5. Training data’s Performance curve.

FIGURE 4. Performance of testing data.

TABLE 2. The performance evaluation of training and testing data using


PCA-LRM on PIDD.

FIGURE 6. Training data’s ROC curve.

The performance of LRM can be analyzed by its accuracy


in prediction of disease. The implementation of LRM on
PIDD uses the concept of probability to reveal prediction of FIGURE 7. Testing data’s Performance curve.
diabetic or nondiabetic patient. The default probability value
considered is 0.5 which is the cutoff value. If the probability
value falls below 0.5 then the possible LRM prediction is neg- against False Positive Rate (FPR). Here TPR is named as
ative for diabetes otherwise if it is above 0.5, LRM predicts Sensitivity and calculated FPR is known as (1- Specificity).
the patient to be diabetic. Hence ROC curve is used to analyze the optimality of LRM.
The overall performance of the model on PIDD is The ROC curve is interpreted by calculating value of Area
represented in Table 2. Under Curve (AUC). The overall ROC curve value is 1 which
From PIDD, 80% of data is considered for training the is the cumulative probability distribution value. A slope is
data. The snapshot of performance curve of training data of drawn and used as a partition to analyze a ROC curve. The
PIDD is shown in Figure 5. The performance curve of LRM ROC curve if present above the slope indicates that LRM
is analyzed with respect to accuracy and cutoff value used in prediction is more than 50% where as if ROC curve if present
disease prediction. Figure 5 depicts that at cutoff value 0.5, below the slope then LRM prediction is less than 50% which
LRM prediction accuracy raises above 70%. LRM prediction is not good for a model.
accuracy falls low when the cutoff value is below 0.5. The snapshot of ROC curve of training data of PIDD is
The Receiver Operating Characteristics (ROC) curve of shown in Figure 6. In this Figure 6 the prediction ROC curve
LRM demonstrates ability of classification based on the vari- is distributed over various cutoff values ranging between
ation of cutoff values. The construction of ROC curve at −0.53 and 0.82. The LRM prediction of ROC curve is more
various cutoff values is plotted by True Positive Rate (TPR) than 50% as the curve is above the slope. It is observed from

VOLUME 7, 2019 105317


H. Roopa, T. Asha: Linear Model Based on PCA for Disease Prediction

−0.19 and 1.01. The LRM prediction of ROC curve is more


than 50% as the curve is above the slope. It is observed from
the Figure 8 that near 0.5 cutoff value, TPR value is high. The
calculated AUC value of testing data is 0.8487.
There are various existing, implemented methods on
PIDD. Few are mentioned in Table 3, which gives details
about the performance measure of all the executed methods.
It is analyzed from Table 3 that PCA-LRM achieves improved
accuracy on PIDD than other prevailing methods.

V. CONCLUSION AND FUTURE WORK


FIGURE 8. Testing data’s ROC curve.
Feature extraction and statistical modeling on PIDD is
TABLE 3. Comparison of outcomes of PIDD with several existing
presented in this research work. The PIDD features are
methods. extracted to a new space using PCA. These newly projected
features are then modeled using LRM to predict whether the
patient is diabetic or normal. The results obtained in this
study have achieved high accuracy rate for predicting diabetes
when compared with other existing methods. The proposed
statistical model can be adopted for predicting different kinds
of diseases like tuberculosis, eye disease, cancers, etc.,

REFERENCES
[1] K. Polat, S. Güneş, and A. Arslan, ‘‘A cascade learning system for
classification of diabetes disease: Generalized discriminant analysis and
least square support vector machine,’’ Expert Syst. Appl., vol. 34, no. 1,
pp. 482–487, 2008.
[2] M. F. Ganji and M. S. Abadeh, ‘‘A fuzzy classification system based on
ant colony optimization for diabetes disease diagnosis,’’ Expert Syst. Appl.,
vol. 38, no. 12, pp. 14650–14659, 2011.
[3] M. Seera and C. P. Lim, ‘‘A hybrid intelligent system for medical data
classification,’’ Expert Syst. Appl., vol. 41, no. 5, pp. 2239–2249, 2014.
[4] S. Sa’di, A. Maleki, R. Hashemi, Z. Panbechi, and K. Chalabi, ‘‘Compari-
son of data mining algorithms in the diagnosis of type II diabetes,’’ Int. J.
Comput. Sci. Appl., vol. 5, no. 5, pp. 1–12, 2015.
[5] R. Bansal, S. Kumar, and A. Mahajan, ‘‘Diagnosis of diabetes mellitus
using PSO and KNN classifier,’’ in Proc. Int. Conf. Comput. Commun.
Technol. Smart Nation (IC3TSN), Oct. 2017, pp. 32–38.
[6] D. K. Choubey, S. Paul, S. Kumar, and S. Kumar, ‘‘Classification of Pima
indian diabetes dataset using naive Bayes with genetic algorithm as an
attribute selection,’’ in Proc. Int. Conf. Commun. Comput. Syst. (ICCCS),
Feb. 2017, pp. 451–455.
[7] H. Roopa and T. Asha, ‘‘Feature extraction of chest X-ray images and
analysis using PCA and kPCA,’’ Int. J. Elect. Comput. Eng., vol. 8, no. 5,
the Figure 6 that near 0.5 cutoff value, TPR value is high. The p. 3392, Oct. 2018.
[8] H. Roopa and T. Asha, ‘‘Analysis of feature ranking methods on
calculated AUC value of training data is 0.8196. X-ray images,’’ in Proc. Int. Conf. ISMAC Comput. Vis. Bio-Eng. Cham,
From PIDD, 20% of data is considered for testing the data. Switzerland: Springer, 2018, pp. 1393–1403.
The snapshot of performance curve of testing data of PIDD [9] UCI Repository of Machine Learning. Accessed: 2017. [Online]. Avail-
able: http://www.ics.uci.edu./~mlearn/ML Repository.html
is shown in Figure 7. The performance curve of LRM is [10] H. Roopa and T. Asha, ‘‘An efficient prediction model to analyze tubercu-
analyzed with respect to accuracy and cutoff value used in losis chest X-ray images,’’ Int. J. Sci. Eng. Res., vol. 10, no. 1, pp. 694–700,
disease prediction. Figure 7 depicts that at cutoff value 0.5, 2019.
[11] D. Sisodia and D. S. Sisodia, ‘‘Prediction of diabetes using classification
LRM prediction accuracy raises above 80%. LRM prediction algorithms,’’ Procedia Comput. Sci., vol. 132, pp. 1578–1585, Jan. 2018.
accuracy falls low when the cutoff value is below 0.5. [12] A. Ahmad, A. Mustapha, E. D. Zahadi, N. Masah, and N. Y. Yahaya,
The snapshot of ROC curve of testing data of PIDD is ‘‘Comparison between neural networks against decision tree in improving
prediction accuracy for diabetes mellitus,’’ in Proc. Int. Conf. Digit. Inf.
shown in Figure 8. In this Figure 8 the prediction ROC curve Process. Commun. Berlin, Germany: Springer, 2011, pp. 537–545.
is distributed over various cutoff values ranging between

105318 VOLUME 7, 2019

S-ar putea să vă placă și