Documente Academic
Documente Profesional
Documente Cultură
coefficient of variation
Simon Fong
1
, J ustin Liang
1
, Suash Deb
2
1
Department of Computer Science, University of Macau, Taipa, Macau SAR
2
Department of Computer Science and Engineering, Cambridge Institute of Technology, Ranchi, India
Abstract
Diabetes has become a prevalent metabolic disease nowadays, affecting patients of all age groups and large populations
around the world.Early detection would facilitate early treatment that helps for certain the prognosis. In the literature of
computational intelligence and medical care communities, different techniques have been proposed in predicting diabetes
based on the historical records of related symptoms. The researchers share a common goal of improving the accuracy of such
diabetes prediction model. In addition to the model induction algorithms, feature selection is a significant approach in
retaining only the relevant attributes for the sake of building a quality prediction model later. In this paper, a novel and simple
feature selection criterion called Coefficient of Variation (CV) is proposed as a filter-based feature selection scheme. By
following the CV method, attributes that have too low the data dispersion are disqualified from the model construction
process. Thereby the attributes which are factors leading to poor model accuracy are discarded. The computation of CV is
simple hence enabling an efficient feature selection process. Computer simulation experiments by using the Prima Indian
diabetes dataset is used to compare the performance of CV with other traditional feature selection methods. Superior results
by CV are observed.
2013 Elsevier Science. All rights reserved.
Keywords: Diabetes prediction; Classification; Data Mining; Feature Selection.
1. Introduction
Diabetes is a global health concern in both developed and developing countries, and its prevalenceis rising. In
just UK alone, 2.9 million people are suffering from diabetes mellitus in 2011 that constitutes to 4.45% of the
population [1]. By 2025, it is projected to have 5 million people in UK inflicted with diabetes. This incurable
metabolic disorder is chronic and characterized by deficiency of insulin secretion or insensitivity of the body
tissues to insulin. The former is known as Type-I insulin-dependent diabetes mellitus (IDDM) where the body
defects to produce sufficient insulin due to autoimmune destruction of pancreatic -cells. As a result the patients'
body cells may wither because they cannot absorb the needful amount of glucose in the bloodstream without this
important hormone. The second type is called Type-II non-insulin-dependent diabetes mellitus which is usually
associated with obesity and lack of bodilyexercises. It will inevitably lead to insulin treatment, probably for life-
long. Early detection of diabetes has become vital and the detection techniques are maturing over the years.
However, it is reported that about half of the patients with Type II diabetes are undiagnosed and the latency from
disease onset to diagnosis may exceed over a decade [2, 3]. Therefore, the importance of early prediction and
detection of diabetes that enables timely treatment of hyperglycaemia and related metabolic abnormalitiesis
escalating.
In the light of this motivation, diabetes prediction models are being formulated and developed in machine-
learning research community that claimed to be able to do blood glucose prediction based on the historical
records of diabetes patients and their relevant attributes. One of the most significant works is by J an Maciejovski
[4] who formulated predictive diabetic control by using a group of linear and non-linear programming functions
that take into consideration of variables and constraints. The other direction related to blood glucose prediction is
time-series forecasting [5], which take into account of the measurements of the past blood glucose cycles, in order
Proceedings of International Conference on Computing Sciences
WILKES100 ICCS 2013
ISBN:978-93-5107-172-3
265 Elsevier Publications, 2013
*
Corresponding author -
Simon Fong, Justin Liang and Suash Deb
to do some short-term blood glucose forecast. Another popular choice of algorithm in implementing a blood
glucose predictor is artificial neural network [6, 7, 8] which non-linearly maps daily regimens of food, insulin and
exercise expenditure as inputs to a predicted output. Although neural network predictors usually can achieve a
relatively high accuracy (88.8% as in [8]), the model itself is a black-box where the logics in the process of
decision making are mathematical inference. For example, numeric weights associated in each neuron and the
non-linear activation function. Recently some researchers advocated applicability of decision trees in predicting
diabetic prognosis such as batch-training model [9] and real-time incremental training model [10]. The resultant
decision tree is in a form of predicate logics IF-THEN-ELSE rules which are descriptive enough for decision
support when the rules are embedded in some predictor system, as well as for reference and studies by clinicians.
However, one major drawback on decision tree is the selection of the appropriate data attributes or features that
should be general enough to model the historical cases, while providing sufficiently high prediction accuracy in
the event of unseen case.
Potentially there exist many factors (so-called features) for analysis and diagnosis of the diabetes of patients;
these factors may be direct physiological symptoms or lifestyle habits that contribute to the disease. However,
there is no standard rule-of-thumb in deciding which of these factors into the inclusion of the model induction
[11], given different physicians might have their own opinions. At convenience when all the available features are
included in the process of model construction, quite often some of these features may found to be insignificant or
irrelevant. Consequently the accuracy of the prediction model reduces because these the inappropriate feature
might have added randomness to the data or the values of these features lead to biased results. Although the topic
of feature selection has been widely studied, to the best of the authors knowledge, a comprehensive evaluation of
feature selection methods pertaining to the neural network and decision tree classification has not been done so
far. The existing research works either focus on a classification model, especially support-vector-machine (SVM)
or on a few feature selection techniques. For instance, research teams of the works [12, 13, 14] dedicated research
efforts on solving the feature selection by using SVM classifier and its variants. Huang et al, [15] researched the
diabetes prediction problem with a variety of classifiers such as CART decision tree and so forth, a singular
feature selection called ReliefF was used. In this paper, we propose a novel feature selection method called
Coefficient of Variation (CV) which is characterized by its simple and efficient computation. In comparison to
other popular feature selection methods which are based on calculating the information gain or correlation among
attributes and to the target classes, CV only calculates the ratio of the standard deviation and the mean of each
column of attribute data. The underlying principle is that a good attribute should have its data that vary
considerably in value, and the data should adequately spread over a certain range, in order to characterize a
quality prediction model. Otherwise having an attribute whose data values diverge insufficiently implies certain
bias may exist in the data. At least such attribute contributes little to the generality of the induction model, by a
relatively narrow data range that it covers. In the context of stochastic optimization, a model induced by such data
would likely lead the result falls into a local optimum.
The reminder of this paper is structured as follow. Section 2 describes the diabetes dataset and the related data
mining techniques to be applied in the experiment. It is known as the data mining framework as a whole. Section
3 reports the results of the experiment, followed by some discussion. Section 4 concludes the paper.
2. Data Mining Framework
This section describes the data to be used in the experiment, the feature selection algorithms and the model
induction algorithms. Though sometimes this is commonly known as KDD, the details especially the attributes
are clearly presented so to appreciate the efficacy of the CV feature selection method which follows.
2.1. Diabetes Dataset
One of the most popularly used datasets in testing machine learning algorithms for diabetes prediction is Pima
Indian diabetes dataset [16]. Generally the dataset is challenging to building a highly accurate model because all
the attributes do not have a profound relation to the predictable class, though these attributes are believed to be
subtly related to the diabetes diseases somehow. The dataset has eight potentially useful attributes or features, that
describe 768 sample cases of whether the case is of diabetes or not. The binary class is of normal with 500 cases
and abnormal (diagnosed with Pima Indian diabetes) with 268 cases. The ratio is considered quite balanced, with
a ratio of diabetes-free and confirmed diabetes 65.10% : 39.89%. The diabetes cases are those that are diagnosed
with diabetes onset within five years.The data are owned by Peter Turney, National Institute of Diabetes and
266 Elsevier Publications, 2013
Diabetics Prediction by Using Feature Selection Based on Coefficient of Variation
Digestive and Kidney Diseases. And the database is donated by Vincent Sigillito, Research Center, RMI Group
Leader, Applied Physics Laboratory, The J ohns Hopkins University, J ohns Hopkins Road, Laurel, United States.
The eight features are briefly described as follow together with their acronyms.
Feature 1 preg: Number of times pregnant
Feature 2 plas: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Feature 3 pres: Diastolic blood pressure (mm Hg)
Feature 4 skin: Triceps skin fold thickness (mm)
Feature 5 insu: 2-Hour serum insulin (mu U/ml)
Feature 6 mass: Body mass index (weight in kg/(height in m)
2
)
Feature 7 pedi: Diabetes pedigree function
Feature 8 age: Age (in years)
Class class: variable (0 or 1)
2.2. Feature Selection
The dataset has not been pre-processed, cleansed, nor filtered in our experiment. All the 768 instances are used
as they originally are in the model induction. Nevertheless, feature selection is applied prior to model induction;
the standard feature selection algorithms are those made available by Weka [17] which is a popular software
platform of machine learning algorithms for solving data mining problems implemented in J ava and open sourced
under the GPL license by the University of Waikato, New Zealand. Most of these algorithms are documented in
[18], and their implementations are provided by Weka as built-in functions as follow. Readers who want further
information about these algorithms can refer to the users manual of Weka [17] or the survey [18].
cfsubseteval: Evaluates the worth of a subset of attributes by considering the individual predictive ability of
each feature along with the degree of redundancy between them.
ChiSquaredAttributeEval: Evaluates the worth of an attribute by computing the value of the chi-squared
statistic with respect to the class.
InfoGainAttributeEval: Evaluates the worth of an attribute by measuring the information gain with respect to
the class.
PrincipalComponents: Performs a principal components analysis and transformation of the data.
SignificanceAttributeEval: Evaluates the worth of an attribute by computing the Probabilistic Significance as a
two-way function.
CorrelationAttributeEval: Evaluates the worth of an attribute by measuring the correlation (Pearson's) between
it and the class.
SVMAttributeEval: Evaluates the worth of an attribute by using an SVM classifier.
ReliefFAttributeEval: Evaluates the worth of an attribute by repeatedly sampling an instance and considering
the value of the given attribute for the nearest instance of the same and different class.
SymmetricalUncertAttributeEval: Evaluates the worth of an attribute by measuring the symmetrical
uncertainty with respect to the class.
A new feature selection method is proposed in this paper, based on Coefficient of Variation, called
CVAttribeEval. The method is programmed as a Weka extension plug-in in J ava language. It is founded on the
belief that a good attribute in a training dataset should have its data vary sufficiently wide across a range of
values, so that it is significant to characterize a useful prediction model. To illustrate this concept, visually, the
eight attributes of the Pima Indian diabetes dataset are plotted by using the Projection Plot in Weka. Since the
original attribute values take in various unit scales, e.g. Age or Body mass index usually is a double digit number,
and Diabetes pedigree function has at least a 3-digit number, the attributes are first subject to normalization into
fixed range of [0, 1]. In the visualization program, the attribute values are displayed in seven different scales of
color tones, indicating the ranges of values that the data fall into. The farthest ends of values are in solid red and
blue, the central values are in pale, and those in between are in mild color. The color chart is shown in Figure 1.
The outputs of the eight attributes are shown in Figures 2 (a-h).
267 Elsevier Publications, 2013
Simon Fong, Justin Liang and Suash Deb
Fig. 1. Color chart for displaying the values of attribute data in different hues.
Fig. 2(a). Visualization of data values of attribute preg. Fig. 2(b). Visualization of data values of attribute plas.
Fig. 2(c). Visualization of data values of attribute pres. Fig. 2(d). Visualization of data values of attribute skin.
Fig. 2(e). Visualization of data values of attribute insu. Fig. 2(f). Visualization of data values of attribute mass.
Fig. 2(g). Visualization of data values of attribute pedi. Fig. 2(h). Visualization of data values of attribute age.
Low data dispersion
Low data dispersion
Low data dispersion
Low data dispersion
268 Elsevier Publications, 2013
Diabetics Prediction by Using Feature Selection Based on Coefficient of Variation
As it can be observed from the collection of visualizations, the attributes do have a good distribution over the
data space except plas, pres, mass and age, which are shown in Figures 2 (b) (c) (f) and (h) respectively. These
attributes which do not have an far spread in the data space often too are carrying mediocre data ranges, as
represented by white to grey colors in the data points. That is, the data of these attributes do not vary much in the
data scale; hence these attributes may not contribute significantly to an accurate prediction model. These visual
patterns can be quantified by using CV computation.
Let X is a training dataset with n instances of vector whose values are characterized by a total of m attributes or
features. An instance is a m-dimensional tuple, in the form of (x
1
, x
2
, ..x
m
). For each x
a
where a[1..m], can be
partitioned into subgroups of different classes where cC is the total number of prediction target classes. So that
x
a
{
..
where