Effect of Label Noise in The Complexity of Classification Problems

Universidade de So Paulo
Biblioteca Digital da Produo Intelectual - BDPI

Departamento de Cincias de Computao - ICMC/SCC
Artigos e Materiais de Revistas Cientficas - ICMC/SCC
2015-07
Effect of label noise in the complexity of

classification problems
Neurocomputing, Amsterdam, v. 160, p. 108-119, Jul. 2015
http://www.producao.usp.br/handle/BDPI/50947
Downloaded from: Biblioteca Digital da Produo Intelectual - BDPI, Universidade de So Paulo
Neurocomputing 160 (2015) 108119
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Effect of label noise in the complexity of classication problems

Lus P.F. Garcia a,n, Andr C.P.L.F. de Carvalho a, Ana C. Lorena b
a
b
Instituto de Cincias Matemticas e de Computao, Universidade de So Paulo, Trabalhador So-carlense Av. 400, So Carlos, So Paulo 13560-970, Brazil
Instituto de Cincia e Tecnologia, Universidade Federal de So Paulo, Talim St. 330, So Jos dos Campos, So Paulo 12231-280, Brazil
art ic l e i nf o
a b s t r a c t
Article history:
Received 11 March 2014
Received in revised form
8 October 2014
Accepted 11 October 2014
Available online 11 February 2015
Noisy data are common in real-world problems and may have several causes, like inaccuracies,
distortions or contamination during data collection, storage and/or transmission. The presence of noise
in data can affect the complexity of classication problems, making the discrimination of objects from
different classes more difcult, and requiring more complex decision boundaries for data separation. In
this paper, we investigate how noise affects the complexity of classication problems, by monitoring the
sensitivity of several indices of data complexity in the presence of different label noise levels. To
characterize the complexity of a classication dataset, we use geometric, statistical and structural
measures extracted from data. The experimental results show that some measures are more sensitive
than others to the addition of noise in a dataset. These measures can be used in the development of new
preprocessing techniques for noise identication and novel label noise tolerant algorithms. We thereby
show preliminary results on a new lter for noise identication, which is based on two of the complexity
measures which were more sensitive to the presence of label noise.
& 2015 Elsevier B.V. All rights reserved.
Keywords:
Classication
Label noise
Complexity measures
Noise Filter
1. Introduction
Noisy data can be regarded as objects that present inconsistencies in their predictive and/or target attribute values [1]. These
inconsistencies can be either errors, absent information or
unknown values [2]. Whereas noise needs to be identied and
treated, secure data in a dataset must be preserved [3]. The term
secure data usually refers to instances that are core of the knowledge necessary to build accurate learning models.
There are various strategies and techniques in the literature to
handle with noise data [48,3,9]. Generally, the identied noisy
data are ltered and removed from the datasets. Nonetheless, it is
usually hard to determine if a given instance is indeed noisy or
not. Some recent proposals include designing classication techniques more tolerant and robust to noise, as surveyed in [10].
Despite the strategy employed to deal with noisy data, either by
data cleansing or by the design of noise-tolerant learning algorithms, it is important to understand the effect of noise in the
complexity of classication problems. This analysis can support
the development of new techniques for dealing with noisy data
more effectively.
This paper experimentally investigates the effect of distinct
levels of label noise in the complexity of classication problems.
Corresponding author. Tel.: 55 16 3373 8161; fax: 55 16 3373 9633.

E-mail addresses: lpfgarcia@gmail.com, lpgarcia@icmc.usp.br (L.P.F. Garcia),
andre@icmc.usp.br (A.C.P.L.F. de Carvalho), aclorena@unifesp.br (A.C. Lorena).
http://dx.doi.org/10.1016/j.neucom.2014.10.085
0925-2312/& 2015 Elsevier B.V. All rights reserved.
For such, we employ a series of statistic and geometric descriptors

described in [11]. These indices account for the difculty of a
classication problem by analyzing some characteristics of the
dataset and the predictive performance of some simple classication models induced using the dataset. The descriptors are
grouped into three main categories: measures of overlapping
between feature values, measures of class separability and measures of data geometry, topology and density. We also employ
some structural indices, captured by representing the dataset
through a graph structure [12]. These measures extract topological
and structural properties from the graphs built [13,14].
Throughout the experiments, we identify those indices that are
more sensitive to the presence of label noise and can thereby be used
to support noise identication. Afterwards, we preliminarily propose a
simple preprocessing technique for label noise identication, which
uses two of the measures that were more sensitive to the presence of
label noise. Nonetheless, we emphasize that the studies carried out
can support the development of other data cleaning techniques and of
novel noise-tolerant learning algorithms too.
The main contributions from this study can be summarized as
Show that the presence of label noise at different levels
inuences the complexity of a classication problem. This is

performed by monitoring a group of measures able to characterize the complexity of a classication problem from different perspectives;
Analyze a new set of features which characterizes the complexity of the problem by modeling a classication dataset through
L.P.F. Garcia et al. / Neurocomputing 160 (2015) 108119
a graph structure. These measures consider distinct topological

properties of the graph built from the underlying classication
dataset;
Highlight the measures that are most sensitive to label noise
imputation and use some of them to propose a new preprocessing technique able to identify label noise in a dataset.
This paper is structured as follows. Section 2 denes label

noise. Section 3 describes the measures employed in this paper to
characterize the complexity of the noisy classication problems,
while Section 4 presents the experimental methodology followed
to evaluate the sensitivity of the complexity measures to label
noise imputation. Section 5 presents and discusses the results
achieved in this analysis. Section 6 describes a new lter for noise
identication, based on a subset of measures previously investigated and presents experimental results related to the technique
performance. Finally, Section 7 concludes this paper.
2. Label noise
In supervised learning, a learning algorithm is applied to a
!
!
dataset containing n pairs x i ; yi , where each x i is a data point
described by m predictive features and yi corresponds to the
!
expected label of x i . In classication problems, this label corresponds to a class or a category. The learning algorithm then
induces a classication model, which should be able to predict
the label of new data points. The presence of noise in a dataset
affects its quality and may impair the predictive performance of
the classiers induced.
When dealing with supervised classication problems, noise
can be found in [15]: (a) predictive features and (b) target
attribute. Noise in predictive features is introduced in one or more
predictive attributes as a consequence of incorrect, absent or
unknown values. On the other hand, noise in target attributes
occurs in the class labels. They can be caused by errors or
subjectivity in data labeling, as well as by the use of inadequate
information in the labeling process. Lately, noise in predictive
features can lead to a wrong labeling of the data points, since they
can be moved to the wrong side of the decision frontier.
The majority of the existent Machine Learning (ML) algorithms
for solving classication problems try to minimize a cost function
based on misclassication, assuming the target label values in the
dataset as correct. Due to the importance of the class information
in supervised learning, in this paper, we deal with noise in the
target attribute only.
Learning in noisy environments has been largely investigated
in recent decades [10]. Some authors prefer to deal with the
problem while the classication model is induced, providing more
robustness to the induced models. The pruning process in Decision
Tree induction algorithms (DT) is an early initiative to increase
their robustness to noisy data [16]. Nonetheless, if the noise level
is high, the denition of the pruning degree can be challenging
and can ultimately remove branches that are based on safe
information too. Another example is the use of slack variables in
Support Vector Machines training [17], which allow some examples to be misclassied or to lie within the margins of separation
between the classes. This introduces an additional parameter to be
tuned during the SVM training: the regularization constant. This
constant accounts for the amount of training examples that can be
misclassied or be placed near the decision boundary induced.
Recent work addresses noise-tolerant classiers, where a label
noise model is learnt jointly to the classication model itself [18].
For such, typically, some information must be available about the
label noise or its effects [10]. The learning algorithm can also be
modied to embed data cleansing [19]. Other authors prefer to
109
treat noise previously, in a preprocessing step. Filters are developed for such, which scan the dataset for unreliable data [7,8,3].
The preprocessed dataset can then be used as input to any
classication algorithm.
One of the contributions of this paper is experimentally
showing how label noise affects the complexity of various classication datasets. Based on these results, we also present a new
noise ltering technique. This evidences the importance of understanding the inuence of label noise in classication problems.
Furthermore, these results may contribute to the proposal of other
lters and of novel noise tolerant algorithms.
3. Complexity indices for describing data

Each noise-tolerant technique and cleansing lter has a distinct
bias when dealing with noise. To better understand their particularities, it is important to know how noisy data affects a classication problem. According to [20], noisy data tends to increase the
complexity of the classication problem. Therefore, the identication and removal of noise can simplify the geometry of the
separation border between the problem classes [21].
Singh [22] recommends a technique that estimates the complexity
of the classication problem using neighborhood information for the
identication of outliers. Sez et al. [23] use measures able to
characterize the complexity of the classication problem to predict
when a noise lter can be effectively applied to a dataset. Smith et al.
[9] propose a measure to capture instance hardness, considering an
instance as hard if it is misclassied by a diverse set of classication
algorithms. The instance hardness measure proposed is afterwards
included into the learning process in two ways. They rst propose a
modication of the error function minimized during neural networks
training, so that hard instances have a lower weight on the error
function update. The second proposal is a lter that removes hard
instances, which correspond to potential noisy data. In [24], we used
data complexity measures as input to classiers which were able to
successfully predict whether a given dataset contains noise or not. All
previous work conrm the effect of noise in the complexity of the
classication problem.
In this paper we evaluate deeply the effects of different noise
levels in the complexity of the classication problems, by extracting different measures from the datasets and monitoring their
sensitivity to noise imputation. According to Ho and Basu [11], the
difculty of a classication problem can be attributed to three
main aspects: the ambiguity among the classes, the complexity of
the separation between the classes, and the data sparsity and
dimensionality. Usually, there is a combination of these aspects.
They propose a set of geometrical and statistical descriptors able
to characterize the complexity of the classication problem
associated with a dataset. Originally proposed for binary classication problems [11], some of these measures were later extended to
multiclass classication in [25,26]. For measures only suitable for
binary classication problems, we rst transform the multiclass
problem into a set of binary classication subproblems by using
the one-vs-all approach. The mean of the complexity values
obtained in such subproblems is then used as an overall measure
for the multiclass dataset.
The descriptors of Ho and Basu [11] can be divided into three
categories:
Measures of overlapping in the feature values: Assess the separability of the classes in a dataset. The discriminant power
of each feature reects its ambiguity level compared to
the other features.
Measures of class separability: Quantify the complexity of the
decision boundaries separating the classes. They are
110
usually based on linearity and on the distance between

examples.
Measures of geometry and topology: They extract features from the
local (geometry) and global (topology) structure of the
data to describe the separation between classes and data
distribution.
Additionally, we characterize the classication dataset as a
graph and extract some structural measures from it. Modeling a
classication dataset through a graph allows capturing additional
topological and structural information from a dataset. In fact,
graphs are powerful tools for representing the information of
relations between data [27]. Therefore, we include an additional
class of complexity measures in the experiments:
Measures of structural representation: They are measures extracted
from a structural representation of the dataset using
graphs, which are built taking into account the relationship among the examples.
The recent work of Smith et al. [9] also proposes a new set of
measures, which are intended to understand why some instances
are hard to classify. Since this type of analysis is not within the
scope of this paper, these measures were not included in our
experiments.
Fisher's discriminant ratio (F1): Selects the feature that best discriminates the classes. It can be calculated by Eq. (1), for
binary classication problems, and by Eq. (2) for problems with more than two classes (C classes). In these
equations, m is the number of input features and fi is the
i-th predictive feature:
ci1 ci2 2
f
i1
F1 max
fi 2
fi 2
c1 c 2
PC
m
cj 1
PC
ck
i1
pcj pck cij cik 2

f
cj 1
PC
cj 1
pcj 2cj
For continuous features, cij and cij 2 are, respectively,

the average and standard deviation of the feature fi
within the class cj. Nominal features are rst mapped
f
into numerical values and cij is their median value, while
fi 2
cj is the variance of a binomial distribution, as
presented in Eq. (3), where pf i is the median frequency
cj
and ncj is the number of examples
in the class cj.
r
fcij pf i 1 pf i nncj
3
f
cj

m minmaxf ; c ; maxf ; c maxminf ; c ; minf ; c
i 1
i 2
i 1
i 2
F2

i 1 maxmaxf i ; c1 ; maxf i ; c2 minminf i ; c1 ; minf i ; c2
4
In multiclass problems, the nal result is the sum of the
values calculated for the underlying binary subproblems. A
low F2 value indicates that the features can discriminate the
examples of distinct classes and have low overlapping.
Maximum individual feature efciency (F3): Evaluates the individual efcacy of each feature by considering how much
each feature contributes to the classes separation. This
measure uses examples that are not in overlapping
ranges and outputs an efciency ratio of linear separability. Eq. (5) shows how F3 is calculated, where n is the
number of examples in the training set and overlap is a
function that returns the number of overlapping examples between two classes. High values of F3 indicate the
presence of features whose values do not overlap between classes.
F3 max
i1
!f i !f i
n overlap x c1 ; x c2
n
3.2. Measures of class separability
3.1. Measures of overlapping in feature values
F1 max
two classes.
cj
High values of F1 indicate that at least one of the features

in the dataset is able to linearly separate data from
different classes. Low values, on the other hand, do not
indicate that the problem is non-linear, but that there is
not an hyperplane orthogonal to one of the data axis that
separates the classes.
Overlapping of the per-class bounding boxes (F2): This measure
calculates the volume of the overlapping region on the
feature values for a pair of classes. This overlapping
considers the minimum and maximum values of each
feature per class in the dataset. A product of the
calculated values for each feature is generated. Eq. (4)
illustrates F2, where fi is the feature i and c1 and c2 are
Distance of erroneous instances to a linear classier (L1): This

measure quanties the linearity of data, since the classication of linear separable data is considered a simpler
classication task. L1 computes the sum of the distances
of erroneous data to an hyperplane separating two
classes. Support Vector Machines (SVM) with a linear
kernel [17] are used to induce the hyperplane. This
measure is used only for binary classication problems.
Values equal to 0 indicate a linearly separable problem.
Training error of a linear classier (L2): Measures the predictive
performance of a linear classier for the training data. It
also uses a SVM with linear kernel. Lower training error
indicate the linearity of the problem.
Fraction of points lying on the class boundary (N1): Estimates the
complexity of the correct hypothesis underlying the data.
Initially, a Minimum Spanning Tree (MST) is generated from
the data, connecting the data points by their distance. The
fraction of points from different classes that are connected in
the MST is returned. High values of N1 indicate the need for
more complex boundaries for separating the data.
Average intra/inter class nearest neighbor distances (N2): The mean
intra-class and inter-class distances use the k-nearest
neighbor (k-NN) algorithm to analyse the spread of the
examples from distinct classes. The intra-class distance
considers the distance from each example to its nearest
example in the same class, while the inter-class distance
computes the distance of this example to its nearest
example in other class. Eq. (6) illustrates N2.
Pn
!
intra xi
N2 Pin 1
6
!
i 1 inter xi
Low N2 values indicate that examples of the same class
are next to each other, while far from the examples of the
other classes.
Leave-one-out error rate of the 1-nearest neighbor algorithm (N3):
Evaluates how distinct the examples are to different
classes by considering the error rate of the 1-nearest
neighbor (1-NN) classier, with the leave-one-out strategy. Low values indicate a high separation of the classes.
3.3. Measures of geometry and topology

Nonlinearity of a linear classier (L3): Creates a new dataset by the
interpolation of training data. New examples are created
by linear interpolation with random coefcients of points
chosen from a same class. Next, a SVM classier with
linear kernel is induced and its error rate for the original
data is recorded. It is sensitive to the spread and overlapping of the data points and used for binary classication problems only. Low values indicate a high linearity.
Nonlinearity of the one-nearest neighbor classier (N4): Has the
same reasoning of L3, but using the 1-NN classier
instead of SVM.
Fraction of maximum covering spheres on data (T1): Builds hyperspheres centered on the data points. The radius of these
hyperspheres are increased until touching any example
of different classes. Smaller hyperspheres inside larger
ones are eliminated. It outputs the ratio of the number of
hyperspheres formed to the total number of data points.
Low values indicate a low number of hyperspheres due
to a low complexity of the data representation.
There are other measures presented in [11,26] that were not
employed in this work because, by denition, they do not vary
when the label noise level is increased. One of them is the
dimensionality of the dataset and another is the ratio of the
number of features to the number of data points (data sparsity).
Similarly, measures like the directional-vector Fisher's discriminant ratio (F1v) and collective feature efciency (F4) from [26]
were disregarded, since they have a concept similar to other
measures already employed.
3.4. Measures of structural representation
Before using these measures, it is necessary to transform the
classication dataset into a graph. This graph must preserve the
similarities and distances between examples, so that the data
relationships are captured. Each data point will correspond to a
node or vertex of the graph. Edges are added connecting all pairs
of nodes or some of the pairs.
Several techniques can be used to build a graph for a dataset.
The most common are the k-nearest neighbor (k-NN) and the NN [28]. While k-NN connects a pair of vertices i and j whenever i
is one of the k-nearest neighbors of j, -NN connects a pair of
nodes i and j only if di; j o , where d is a distance function. We
employed the -NN variant, since many edge and degree based
measures would be xed for k-NN, despite the level of noise
inserted in a dataset. Afterwards, all edges between examples from
different classes are pruned from the graph [28]. This is a
postprocessing step that can be employed for labeled datasets,
which takes into account the class information.
There are various measures able to characterize the topological
and structural properties of a graph. Some of them come from the
statistical characterization of complex networks [14]. We used
some of these graph-based measures in this paper, which are
referred by their original nomenclature, as follows:
Number of edges (Edges): Total number of edges contained in the
graph. High values for edge-related measures indicate that
many of the vertices are connected and, therefore, that there
are many regions of high densities from a same class. This is
111
true because of the postprocessing of edges connecting

examples from different classes applied in this work. Thus,
the dataset is regarded as having low complexity if it shows
a high number of edges.
Average degree of the network (Degree): The degree of a vertex i is
the number of edges connected to i. The average degree of a
network is the average degree of all vertices in the graph.
For undirected networks, it can be computed by Eq. (7),
where vij is equal to 1 if i and j are connected, and 0
otherwise:
degree
1X
v
n i;j ij
The same reasoning of edge-related measures applies to

degree based measures, since the degree of a vertex
corresponds to the number of edges incident to it. Therefore,
high values for the degree indicates the presence of many
regions of high densities from a same class, and the dataset
can be regarded as having low complexity.
Average density of network (Density): The density of a graph is the
fraction of the number of edges it contains by the number of
possible edges that could be formed. The average density
also allows capturing whether there are dense regions from
the same class in the dataset. High values indicate the
presence of such regions and a simpler dataset.
Maximum number of components (MaxComp): Corresponds to the
maximal number of connected components of a graph. In
an undirected graph, a component is a subgraph with
paths between all of its nodes. When a dataset shows a
high overlapping between classes, the graph will probably present a large number of disconnected components, since connections between different classes are
pruned from the graph. The minimal component will
tend to be smaller in this case. Thus, we will assume that
smaller values of the MaxComp measures represent
more complex datasets.
Closeness centrality (Closeness): Average number of steps required
to access every other vertex from a given vertex, which is
the number of edges traversed in the shortest path
between them. It can be computed by the inverse of
the distance between the nodes, as shown in the following equation:
closeness P
1
i a j dvij
Since the closeness measure uses the inverse of the

shortest distance between vertices, larger values are
expected for simpler datasets that will show low distances between examples from the same class.
Betweenness centrality (Betweenness): The vertex and edge
betweenness are dened by the average number of
shortest paths that traverses them. We employed the
vertex variant. Eq. (9) represents the betweenness value
of a vertex vj, where dvil is the total number of the
shortest paths from node i to node l and dj vil is the
number of those paths that pass through j:
betweennessvj
X dj vil
dvil
iajal
The indice for Betweenness will be small for simpler

datasets, since the distance between the shortest paths
and the paths which pass through j will be close.
Clustering Coefcient (ClsCoef): Measures the probability that
adjacent vertices of a graph are connected. The clustering
coefcient of a vertex vi is given by the ratio of the
112
suitable for our scenario and are not used here. One example is the
number of nodes of the graph, which will not vary for a given
dataset despite of its noise level. The only measures in common to
those used in [12] are the number of edges, the average clustering
coefcient and the average degree. Morais and Prati [12] also do
not analyse the expected values of the measures for simpler or
more complex datasets. Besides introducing new measures, we
also describe the behavior of all measures for simpler or complex
problems. Moreover, we try to identify the best suited measures
for detecting the presence of label noise in a dataset.
number of edges between its neighbors and the maximum number of edges that could possibly exist between
these neighbors.
Measure ClsCoef will be larger for simpler datasets,
which will produce large connected components joining
vertices from the same class.
Hub score (Hubs): Measures the score of each node by the number
of connections it has to other nodes, weighted by the
number of connections these neighbors have. That is,
more connected vertices, which are also connected to
highly connected vertices, have higher hub score.
The hub score is expected to have a low mean for high
complexity datasets, since strong vertices will become
less connected to strong neighbors. For instance, hubs
are expected at regions of high density from a given
class. Therefore, simpler datasets with high density will
show larger values for this measure.
Average Path Length (AvgPath): Average size of all shortest paths in
the graph. It measures the efciency of information
spread in the network. It is illustrated by Eq. (10), where
n represents the number of vertices of the graph and
dvij is the shortest distance between vertices i and j.
AvgPath
X
1
dvij ;
nnn 1 i a j
3.5. Summary of measures

Table 1 summarizes the measures employed to characterize the
complexity of the datasets used in this study. For each measure, we
present upper (Maximum value) and lower bounds (Minimum value)
achievable and how they are associated with the increase or decrease
of complexity of the classication problems (Complexity column). For
a given measure, the value in column Complexity is if higher
values of the measure are observed for high complexity datasets, that
is, when the measure value correlates positively to the complexity
level. On the other hand, the sign denotes the opposite, so that
low values of the measure are obtained for high complexity datasets,
denoting a negative correlation.
Most of the bounds were obtained considering the equations
directly, while some of the graph-based bounds were experimentally dened. For instance, for the F1 measure, if the means of the
feature values are always equal, meaning that the classes overlap
for all features (an extreme case), the nominator of Eq. (2) will be
0. Similarly, a maximum value cannot be determined for F1, as it is
dependent on the feature values of each dataset. We denote that
by the 1 value in the table. In the case of graph-based measures,
we generated graphs representing simple and complex relations
between the same number of data points and observed the
measure values achieved. A simple graph would correspond to a
case where the classes are well separated and there is a high
number of connections between examples from the same class,
while a complex dataset would correspond to a graph where
examples of different classes are always next to each other and
ultimately the connections between them are pruned according to
our graph construction method.
10
For the AvgPath measure, high values are expected for

low density graphs, indicating an increase in complexity.
For those measures that are calculated for each vertex individually, we computed an average for all vertices in the graph. The
graph measures used in this paper mainly evaluate the overlapping of the classes and their density.
A previous paper also investigated the use of complex-network
measures to characterize supervised datasets [12]. It used part of
the measures presented here to design meta-learning models able
to predict the best performing model between a pair of classiers
for a given dataset. They also compared these measures to those
from [11], but in a distinct scenario from the one adopted here. It is
not clear whether they employ a postprocessing of the graph for
removing edges between nodes of different classes, as done in this
paper. Also, some of the measures employed in that paper are not
Table 1
Summary of measures.
Type of measure
Measure
Minimum value
Maximum value
Complexity
Overlapping in feature values
F1
F2
F3
0
0
0
1
1
1
Class separability
L1
L2
N1
N2
N3
0
0
0
0
0
1
1
1
1
1
Geometry and topology
L3
N4
T1
0
0
0
1
1
1
Structural representation
Edges
Degree
MaxComp
Closeness
Betweenness
Hubs
Density
ClsCoef
AvgPath
0
0
1
0
0
0
0
0
1=nnn 1
nnn 1=2
n1
n
1=n 1
n 1nn 2=2
1
1
1
0.5

4. Experiments
This section presents the experiments performed to evaluate
how the different data complexity measures from Section 3
behave in the presence of label noise for several benchmark public
datasets. First, a set of classication benchmark datasets were
chosen for the experiments. Different levels of label noise were
afterwards added to each dataset. We then monitor how the
complexity level of the datasets are affected by noise imputation.
This is accomplished by:
1. Verifying the Spearman correlation between the measures
values with the noise rates articially imputed. This analysis
allows identifying a set of measures more sensitive to the
presence of noise in a dataset.
2. Evaluating the correlation between the measures values to
identify those measures that (1) capture different concepts
regarding noisy environments and (2) can be jointly used to
support the development of new noise-handling techniques.
Next, we detail the experimental protocol previously outlined.
4.1. Datasets
For the experiments, we selected articial and real datasets.
The articial datasets were introduced and generously provided by
Amancio et al. [29]. These authors generated articial classication
datasets based on multivariate Gaussians, with different levels of
overlapping between the classes. For this paper we selected 180
balanced datasets (with the same number of examples per class)
with 2 classes, containing 2, 10 and 50 features and with different
overlapping rates for each of the number of features. We chose
such datasets based on the observations of a recent work [9],
which points out that class overlap seems to be a principal
contributor to instance hardness and that noisy data can ultimately be considered hard instances.
113
For the real datasets, we selected 53 benchmarks from UCI and

Keel repositories [30,31] for our experiments. As most of them are
real datasets, it is not possible to assert that they are noiseless,
although some of them are articial and show no label inconsistency. Nonetheless, a recent study showed that most of the
datasets from UCI can be regarded as easy problems, once many
classication techniques are able to attain high accuracies when
applied to them [32]. Table 2 summarizes the main characteristics
of these datasets: number of examples (Examples), number of
features (Features), number of classes (Class) and percentage of
the examples in the majority class (%MC).
4.2. Noise imputation
In order to corrupt the datasets, we used the most common
type of articial noise imputation method for classication problems: uniform random addition [15]. In this approach, each
example has the same probability of having its label exchanged
by another random label [33]. For each dataset, the noise was
inserted at different levels, namely 5%, 10%, 20% and 40%. Thus, we
were able to investigate the inuence of increasing noise levels.
Once the selection of examples was random, we generated 10
different noisy versions of each dataset for each noise level. Since
most of the original benchmark datasets may already contain
some noise level, in this case we have as result a potential noise
level of 5%, 10%, 20% and 40%.
4.3. Methodology
First, we created noisy versions of the original datasets from
Section 4.1 by using the previously described systematic model of
noise imputation. Despite all datasets being partitioned by 10fold-cross-validation, noise was inserted in the training folds only.
The complexity measures were extracted from the original training datasets and from their noisy versions.
To calculate the complexity measures described from Section 3.1
to Section 3.3, we used the Data Complexity Library (DCoL) [26]. All
Table 2
Summary of datasets characteristics: name, number of examples, number of features, number of classes and the percentage of the majority class of each dataset.
Dataset
Examples
Features
Class
%MC
Dataset
Examples
Features
Class
%MC
abalone
acute-inammations
appendicitis
australian
backache
balance
banana
blood-transfusion-service
breast-tissue
bupa
car
cardiotocography
cmc
collins
connectionist-mines-vs-rocks
crabs
expgen
ags
are
glass
habermans-survival
hayes-roth
ionosphere
iris
kr-vs-kp
led7digit
leukemia-haslinger
4153
120
106
690
180
625
5300
748
106
345
1728
2126
1473
485
208
200
207
178
1066
205
306
160
351
150
3196
500
100
9
8
8
15
32
5
3
5
10
7
7
21
10
22
61
6
80
29
12
10
4
5
34
5
37
8
51
19
2
2
2
2
3
2
2
6
2
4
10
3
13
2
2
5
5
6
5
2
3
2
3
2
10
2
16
58
80
55
86
46
55
76
20
57
70
27
42
16
53
50
57
33
31
37
73
40
64
33
52
11
51
molecular-promoters
monk1
monk2
movement-libras
newthyroid
page-blocks
parkinsons
phoneme
pima
ringnorm
saheart
segmentation
spectf
statlog-german
statlog-heart
tae
tic-tac-toe
titanic
vehicle
vowel
waveform-5000
wdbc
wine
wine-quality
yeast
zoo
106
556
601
360
215
5473
195
5404
768
7400
462
2310
349
1000
270
151
958
2201
846
990
5000
569
178
6497
1484
84
58
7
7
91
6
11
23
6
9
21
10
19
45
21
14
6
10
4
19
11
41
31
14
12
9
17
2
2
2
15
3
5
2
2
2
2
2
7
2
2
2
3
2
2
4
11
3
2
3
6
9
4
50
50
65
06
69
89
75
70
65
50
65
14
72
70
55
34
65
67
25
09
33
62
39
43
31
48
114
distance-based measures employed the normalized euclidean distance for continuous features and the overlap distance for nominal
features (this distance is 0 for equal categorical values and 1 otherwise) [34]. To build the graph for the graph-based measures, we used
the -NN algorithm, with the threshold value equal to 15%, like in
[12]. The measures described in Section 3.4 were calculated using the
Igraph library [35].
As a result, we have one meta-level dataset that will be
employed in the subsequent experiments. It contains 20 metafeatures ( complexity and graph-based measures) describing the
benchmark datasets and their noisy versions. This meta-level
dataset has therefore 2173 examples: 53 ( original datasets)
53 ( datasets) n 4 ( noise levels) n 10 ( random versions).
Two types of analysis were performed using the meta-level dataset:
1. Correlation between the measure values and the noise level of
the datasets;
2. Correlation within the measure values.
The rst analysis veries if there is a direct relation between the

noise level of a dataset and the values of the measures extracted from
it. This allows the identication of the measures that are more
sensitive to the presence of noise. For such, the Spearman's rank
correlation between the measure values and the different noise levels
was calculated for all datasets. Those measures that present a
signicant correlation according to the Spearman's statistical test (at
95% of condence value) were selected to be analyzed further.
The second analysis evaluates the Spearman correlation between
the measures with the highest sensitivity to the presence of noise
according to the previous experimental results. It looks for overlapping
in the complexity concepts extracted by these measures. Similar
analyses are carried out in [9] for accessing the relationship between
some instance hardness measures proposed by them. While a high
correlation could indicate that the measures are capturing the same
complexity concepts, a low correlation indicates that the measures
could complement each other, an issue that can be further explored.
5. Results of correlation analysis

The rst analysis will consider all measures. Its results will then
rene a subset of measures more sensitive to noise imputation,
which will be further analysed in the second correlation study.
Noise Rate
F1
0%
5%
This section presents the experimental results for the correlation

analysis previously described. We also have evaluated the results for
10%
20%
40%
F2
20
F3
L1
15
15
15
15
10
10
L2
20
10
10
L3
15
N1
N2
10
20
20
10
10
10
5
0
Density
N3
N4
T1
15
20
20
Edges
20
10
10
10
10
5
Degree
Density
MaxComp
Closeness
10.0
20
20
12
7.5
5.0
10
10
2.5
0
0.0
Betweenness
Hub
ClsCoef
10.0
9
7.5
7.5
5.0
2.5
0.0
0.00
0.25
0.50
0.75
1.00
5.0
2.5
0
0.00
0.25
0.50
0.75
1.00
AvgPath
10.0
0.0
0.00
0.25
0.50
0.75
Normalized Range
Fig. 1. Histogram of each measure for distinct noise levels.
1.00
0.00
0.25
0.50
0.75
1.00
115
capture: classes separability (N3, N1, N2, L2 and L1), data topology
according to a nearest neighbor classier (N4, T1 and L3) and
individual feature overlapping (F2). Regarding those measures
indirectly related to the noise levels, two are basic complexity
measures based on feature overlapping (F1 and F3), while six are
based on structural representation (Density, Hub, Degree, ClsCoef,
Edges and MaxComp). Only the Closeness and Betweenness
measures did not present signicant correlation to the noise levels.
As expected, the most prominent measures are the same that
showed more distinct values for different noise levels in the
histograms from Fig. 1.
Despite the statistical difference, it is possible to notice some
low correlation values in Fig. 2. Only the measures N3, N1 and N2
presented correlation values higher than 0.5. These correlations
were higher in the experiments with articial datasets. This can be
a result of the fact that, for real datasets, the amount of noise
added is potential rather than actual.
some articial datasets as described in Section 4.1. These results were

quite similar to those observed for the benchmark datasets, with the
difference that the absolute correlation values calculated were higher
for the articial datasets. The complete set of results, including those
obtained for the articial datasets and the datasets employed, can be
consulted at http://lpfgarcia.github.io/rcorr/.
Fig. 1 presents histograms of the values of the complexity
measures for all benchmark datasets when random noise is added.
The bars are colored according to the amount of noise inserted,
from 0% (original datasets) to 40%. The measure values were
normalized considering all datasets to allow their direct comparison. It is possible to notice that some of the measures are more
sensitive to noise imputation and present clear limits on their
values for different noise levels. They are: N1, N3, Edges, Degree
and Density. On the other hand, other measures like Betweenness
do not present a clear contrast in their values for different noise
levels.
Furthermore, it is also possible to notice from Fig. 1 that, as
more noise is added to the datasets, the complexity of the
classication problem tends to increase. This is reected in the
values of the majority of the complexity measures, that either
increased or decreased when noise is added, in accordance to their
positive or negative correlation to the complexity level, as shown
in Table 1 (column Complexity). For instance, higher N1 values
are expected for more complex datasets and the N1 values indeed
increased for higher levels of noise. On the other hand, lower F1
values are expected for more complex datasets and we can
observe that as more noise is added to the datasets, the F1 values
tend to reduce.
5.2. Correlation between measures

In order to verify whether the measures capture similar or
distinct information from data, we calculated pairwise correlations
between their values. Only those measures considered more
relevant in the previous analysis were considered. These measures
were highlighted as more sensitive to noise imputation and can
therefore be successfully employed for noise identication.
Fig. 3 shows a heatmap of the correlation between these pairs
of measures. Each column and row corresponds to a measure. Each
box is colored according to the correlation values calculated, from
gray (highest correlation, despite positive or negative) to white
(lowest correlation). The absolute values of all correlations are also
shown inside the heatmap cells. We highlight in bold the correlation values that are not signicant according to the Spearman's
correlation test (at 95% of condence level). These pairs of
measures correspond to those that can potentially complement
each other.
According to the heatmap, various measures are weakly correlated to each other. Therefore, they capture distinct aspects from
the data. As expected, the measures N1, N2, N3 and N4 from [11]
are highly correlated. They are all based on nearest neighbor
information. Despite the fact that all structural representation
measures are extracted from a nearest neighbor graph, their
correlation to N1, N2, N3 and N4 is low in several cases. Among
the graph-based measures, high correlations are observed
between Edges, Degree and MaxComp. Since the degree of a graph
5.1. Correlation of measures to the noise level

Fig. 2 shows the correlation between the values of the measures for the different noise levels in the datasets. Positive and
negative values are plotted in order to show clearly which
measures are directly or indirectly correlated to the noise levels.
It is noticeable that, as the noise level increases, the values of the
complexity measures either increase or reduce accordingly, indicating increases in the complexity level of the noisy datasets. The
closer to 1 or 1, the higher is the relation between the measure
and the noise level.
According to the statistical test employed, 18 measures presented signicant correlation to the noise levels, at 95% of
condence. Among the measures with direct correlation to the
noise level, nine are basic complexity measures from the literature
(N3, N1, N2, N4, L2, L1, T1, F2, and L3). These measures mainly
1.0
0.0
0.5
Measures
Fig. 2. Correlation of each measure to the noise levels.
N3
N1
N2
N4
L2
L1
T1
F2
L3
AvgPath
Betweenness
Closeness
MaxComp
Edges
ClsCoef
Degree
Hub
F3
Density
1.0
F1
Correlation
0.5
116
AvgPath 0.15 0.07 0.08 0.22 0.29 0.19 0.21 0.01
0.48 0.03 0.43 0.44 0.37 0.22 0.06 0.06 0.02
Hub 0.15 0.27 0.24 0.34
0.3
0.16 0.28 0.04 0.04 0.06

0.1
0.78
MaxComp 0.51 0.2 0.58 0.21 0.18 0.01 0.01 0.02 0.06 0.36 0.12 0.95 0.97 0.09
Density
0.37 0.17 0.33
0.22 0.38 0.65 0.58 0.62 0.58 0.2 0.06 0.08
Degree
0.4
0.21 0.53 0.12
0.07 0.03 0.09 0.07 0.02 0.33

0
0.15 0.05 0.17
L1 0.64 0.02 0.57
0.78 0.36 0.06
0.08
0.97
0.04 0.08 0.04
0.33
0.24
0.02
0.15
0.11
0.02
0.42
0.33 0.58 0.36 0.39
0.62
0.24
0.06 0.02 0.62 0.06 0.44 0.02
0.84
0.46
0.33
0.01 0.07 0.58 0.02 0.41 0.06 0.01
0.83
0.95
0.57
0.2
0.29
0.18
0.25
0.33 0.16 0.03 0.03 0.38 0.01 0.19 0.22
0.15 0.04 0.12 0.05 0.1
0.98 0.06 0.95 0.08 0.02 0.04

0.2
0.12 0.08 0.26 0.28
0.1
0.16
0.2
0.19
0.01
0.07
0.22
0.18
0.3
0.94 0.05 0.05 0.08 0.02 0.16 0.01 0.05
0.12
0.33
0.21
0.34 0.44 0.22
0.17
Correlation
1.0
0.5
0.0
0.5
1.0
0.09 0.65 0.01 0.42 0.06 0.21
0.37 0.29
0.31
0.27
0.37
0.62 0.09 0.24
0.21 0.37
Degree
Edges
T1
N4
N3
N1
L3
L2
0.34 0.72 0.64 0.62 0.01 0.35 0.24 0.36 0.45 0.08 0.38 0.4
0.2
0.27 0.03 0.07
0.51 0.15 0.48 0.15

AvgPath
0.16
ClsCoef
0.09
Hub
0.47 0.02
L1
1
F1
F1
0.09
0.57 0.51 0.15 0.11 0.09 0.18 0.28 0.09 0.48 0.53 0.17 0.58 0.24 0.43 0.08
F3
F2 0.34
Density
0.72 0.47
F2
F3
0.1
N2
Measures
L2 0.62 0.09 0.51 0.94
0.12 0.17
0.15
N2 0.24 0.27 0.09 0.08 0.04 0.18 0.83
0.16
0.52 0.03
0.42
N3 0.36 0.37 0.18 0.02 0.12 0.25 0.95 0.84
0.01
0.98
0.37
0.06
N4 0.45 0.62 0.28 0.16 0.05 0.33 0.57 0.46 0.62
L3
0.11
0.12
0.01
T1 0.08 0.09 0.09 0.01 0.1 0.16 0.2
N1 0.35 0.31 0.11 0.05 0.15 0.29
0.03 0.37
0.26 0.02 0.08 0.36 0.17 0.52
0.19 0.42 0.41 0.44 0.39 0.08 0.08 0.04
Edges 0.38 0.24 0.48 0.05 0.01 0.03
0.1
MaxComp
ClsCoef
0.2
Measures
Fig. 3. Heatmap of correlation between measures.
is calculated considering the number of its edges and number of

connected components, this correlation is expected by denition.
It is interesting to notice that many of the measures highlighted
as distinguishing the noise levels have low correlation between
them. This is particularly true for class separability measures (e.g.,
N3) when paired to the structural representation measures (e.g.,
Degree and Edges). Therefore, they could be combined to improve
noise identication and handling. This issue is preliminarily
investigated in the next section.
6. Proposed noise ltering technique

Based on the previous experimental results, this section proposes and investigates a new lter, named GraphNN, for label
noise identication. It uses two of the measures more correlated to
the increasing noise levels in the datasets: Leave-one-out error
rate of the 1-nearest neighbor classier (N3); and Average degree
of the network (Degree).
The GraphNN lter identify noisy examples by rst constructing a graph from the dataset, as described in Section 3.4. Afterwards, it uses the degree of each vertex for pointing an example as
a potential noise. In fact, when an example is misclassied, it will

be probably close to examples from another class. In this case, its
edges to close examples will be pruned and the example will tend
to have a low degree value. Safe examples, on the other hand, will
be connected to a high number of examples from the same class
and show a high degree value. Therefore, it is necessary to
stipulate a threshold on the node degree so as its corresponding
example can be considered as noisy or not.
When a dataset has a large amount of noise, a larger number of
examples will have a low degree value and the threshold value can
be higher. On the other hand, for datasets with a lower noise level,
a lower threshold value can be required. Otherwise, many safe
examples will be regarded as noisy. Due to the difculty in
selecting a threshold value, it would be appropriate for all of the
datasets employed in our experiments to use the N3 measure
value to estimate the percentage of noise in the dataset. This was
the most correlated measure to the noise levels in our experiments
and for which clearer limits on the values obtained for distinct
noise levels can be observed. The value adopted to build the
graph from data also inuences on the average degree of the nodes
and can inuence the threshold choice, although this was not
studied in our experiments.
Therefore, in GraphNN we rst order all examples according to

their degree values. Afterwards, the N3 value delimits how much
of the examples of lower degree can be regarded as noisy.
Furthermore, among the examples with a degree lower than the
threshold, only those that are misclassied by the NN classier
used in N3 are considered noisy. This polling allows more robustness for maintaining safe examples.
6.1. Experiments on noise ltering
We performed a set of experiments in order to analyse the
performance of the GraphNN lter in identifying noisy examples.
The same datasets used in our previous analysis were employed in
these experiments.
Once noise was added in a controlled way, it is possible to
assess which are the noisy examples and to record the performance of the lter in retrieving them. We thereby employ the F1measure in this evaluation, as illustrated by Eq. (11) and proposed
in [3]. This measure combines the precision and recall values of a
lter in noise identication: precision is dened as the number of
correctly identied noisy cases divided by the number of examples
identied by the lter as noisy; recall is the number of correctly
identied noisy cases divided by the total number of noisy
examples introduced in the dataset.
precisionnrecall
F 1 2n
precision recall
11
As baselines, we included some distance-based noise ltering

techniques from literature [36,4]. All of them employ the k-nearest
neighbor (k-NN) algorithm to determine whether an example is
suspicious or not. Therefore, according to them, an example is
117
consistent only if it is located close to examples from its class. This

resembles our method, where the graph captures the similarity
between examples from the same class. Otherwise, the example is
either incorrectly labeled or in the decision border. In the later
case, it is also considered unsafe, since small perturbations in a
borderline example can move it to the wrong side of the decision
border.
The Edited Nearest Neighbor (ENN) technique removes an
example if the labels of its k nearest neighbors are different from
its actual label [37]. Repeated ENN (RENN) is a variation of ENN,
which applies ENN repeatedly until all objects have the majority of
their neighbors from the same class. In All-kNN, instead of using a
xed value of k, increasing values of k are considered [4]. At the
end of each iteration, examples that have the majority of their
neighbors from other classes are marked as noisy. These signalized
examples are then eliminated from the dataset.
6.2. Performance of the cleansing lter

The F1-measure values obtained by the lters in the retrieval of
the articially imputed noisy examples are shown in the heatmap
of Fig. 4. In this gure, each column represents one lter, while
each row corresponds to a dataset. The closer the color is to black,
the higher is the F1-score computed. The gray color represents an
intermediate F1-score and the white color represents low F1-score
levels. According to the heatmap, the GraphNN lter had attained,
in average, a good predictive performance for most of the datasets.
For the banana, connectionist-mines-vs-rocks, crabs, kr-vs-kp,
monk2, ringnorm, titanic and tic-tac-toe datasets, the performance decreased.
leukemiahaslinger
0.35
0.35
0.4
0.52
zoo
0.49
0.86
0.91
0.74
led7digit
0.3
0.51
0.52
0.52
yeast
0.41
0.38
0.42
0.51
krvskp
0.54
0.54
0.56
0.49
winequality
0.3
0.43
0.47
0.51
iris
0.74
0.74
0.78
0.88
wine
0.72
0.72
0.78
0.76
ionosphere
0.45
0.45
0.48
0.39
wdbc
0.59
0.59
0.67
0.79
hayesroth
0.4
0.4
0.42
0.37
waveform5000
0.42
0.42
0.46
0.57
habermanssurvival
0.31
0.31
0.33
0.36
vowel
0.31
0.83
0.87
0.62
glass
0.43
0.49
0.5
0.52
vehicle
0.45
0.45
0.49
0.5
flare
0.49
0.49
0.5
0.51
titanic
0.4
0.4
0.4
0.29
flags
0.36
0.29
0.33
0.41
expgen
0.67
0.67
0.7
0.79
crabs
0.6
0.6
0.62
0.47
connectionistminesvsrocks
0.44
0.44
0.47
0.32
collins
0.3
0.46
0.49
0.67
cmc
0.24
0.24
0.26
0.31
cardiotocography
0.31
0.47
0.49
0.61
car
0.62
0.62
0.71
0.6
bupa
0.24
0.24
0.25
0.29
tictactoe
0.6
0.6
0.68
0.48
tae
0.34
0.34
0.36
0.31
statlogheart
0.42
0.42
0.45
0.39
statloggerman
0.26
0.26
0.29
0.37
0.75
spectf
0.43
0.43
0.45
0.5
0.50
segmentation
0.76
0.76
0.81
0.84
0.25
saheart
0.26
0.26
0.28
0.34
ringnorm
0.3
0.3
0.27
0.09
pima
0.31
0.31
0.35
0.4
breasttissue
0.49
0.49
0.53
0.53
bloodtransfusionservice
0.37
0.37
0.39
0.43
phoneme
0.53
0.53
0.55
0.55
banana
0.57
0.57
0.6
0.47
parkinsons
0.57
0.57
0.61
0.61
0.62
pageblocks
0.79
0.79
0.84
0.89
0.55
newthyroid
0.76
0.76
0.79
0.82
0.37
movementlibras
0.3
0.52
0.55
0.43
0.45
monk2
0.41
0.41
0.44
0.37
0.78
0.81
monk1
0.39
0.39
0.42
0.49
0.2
0.34
molecularpromoters
0.42
0.48
balance
backache
australian
appendicitis
accuteinflammations
abalone
0.55
0.4
0.42
0.46
0.75
0.55
0.4
0.42
0.46
0.75
0.3
0.18
AENN
ENN
0.64
0.47
0.46
0.52
RENN GraphNN
Fig. 4. F-score of lters for each dataset.
0.4
0.4
AENN
ENN
RENN GraphNN
Fscore
118
others. It is also possible to use other combinations of measures to

devise new preprocessing lters. We also plan to employ feature
selection strategies to evidence the best measures able to characterize noisy data. It would be also interesting to investigate how
the graph-based measures are affected by the choice of the
parameter used to build the graph. Finally, we plan to use some of
the highlighted measures to develop new noise-tolerant algorithms and compare GraphNN with other up-to-date noise lters.
We created a webpage with additional information regarding
the experimental results, the codes used for the experiments and
the results for some articial datasets. It can be found at http://
lpfgarcia.github.io/rcorr/.
0.5
Frequency of wins
0.4
0.3
0.2
0.1
Acknowledgments
0.0
AENN
ENN
GraphNN
RENN
Filters
The authors would like to thank FAPESP, CNPq and CAPES for
their nancial support.
References
Fig. 5. Frequency of which each lter had best F1-score.
The number of times each lter performed better in noise

identication for all noise rates and all datasets is presented at
Fig. 5. The best performance was obtained by GraphNN, followed by
RENN and AENN. The ENN lter had shown the worst performance.
7. Conclusion
This work investigated how label noise affects the complexity
of classication tasks, by monitoring the values of simple measures extracted from datasets with increasing noise levels. Part of
these measures were already used in the literature for understanding and analyzing the complexity of classication tasks.
Some other measures that are based on modeling the datasets
by graphs were introduced in this paper.
Experimentally, measures able to capture characteristics like
separability of the classes, alterations in the class boundary and
densities within the classes were the most affected ones by the
introduction of label noise in the data. Therefore, they are good
candidates for further exploitation and to support the design of
new noise identication techniques and noise-tolerant classication algorithms. Moreover, experimental results showed a low
correlation between the basic complexity measures and the graphbased measures, stressing the relevance of exploring different
views and representations of the data structure. Using two of
the most prominent measures from these analyses, we devised a
new noise ltering technique, which has shown promising results
in noise identication.
The graph-based measures Edges, Degree and Density were
highlighted in all analysis carried out. This may have occurred
because, when label noise is introduced, examples from distinct
classes become closer to each other and are not connected in the
graph. For the standard data complexity measures, those that rely
on nearest neighbor information, as N1 and N3, were also able to
better capture the effects of noise imputation. This is also due to
the fact that label noise tends to affect the spatial proximity of data
from different classes. Thus, the idea that data from the same class
tend to be next from each other in the feature space, while far
from examples from different classes, is reinforced.
Our experimental protocol and graph-based measures can also
be used in other types of analysis, such as in verifying the effects of
data unbalance, feature selection, feature discretization, among
[1] J.R. Quinlan, The effect of noise on concept learning, in: R.S.I. Michalski,
J.G. Carboneel, Mitchell (Eds.), Machine Learning, Morgan Kaufmann Publishers Inc., 1986, pp. 149166.
[2] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, Knowledge discovery and data
mining: towards a unifying framework, in: E. Simoudis, J. Han, U.M. Fayyad
(Eds.), KDD, AAAI Press, 1996, pp. 8288.
[3] B. Sluban, D. Gamberger, N. Lavrac, Ensemble-based noise detection: noise
ranking and visual performance evaluation, Data Mining Knowl. Discov. 28 (2)
(2014) 265303. http://dx.doi.org/10.1007/s10618-012-0299-1.
[4] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Trans. Syst.
Man Cybern. 6 (6) (1976) 448452. http://dx.doi.org/10.1109/TSMC.1976.4309523.
[5] C.E. Brodley, M.A. Friedl, Identifying and eliminating mislabeled training
instances, in: W.J. Clancey, D.S. Weld (Eds.), AAAI/IAAI, vol. 1, AAAI Press/The
MIT Press, 1996, pp. 799805.
[6] S. Verbaeten, A.V. Assche, Ensemble methods for noise elimination in
classication problems, in: T. Windeatt, F. Roli (Eds.), Multiple Classier
Systems, Lecture Notes in Computer Science, vol. 2709, Springer, Berlin,
Heidelberg, 2003, pp. 317325. http://dx.doi.org/10.1007/3-540-44938-8_32.
[7] B. Sluban, D. Gamberger, N. Lavrac, Advances in class noise detection, in:
H. Coelho, R. Studer, M. Wooldridge (Eds.), ECAI, Frontiers in Articial
Intelligence and Applications, vol. 215, IOS Press, 2010, pp. 11051106.
[8] L.P.F. Garcia, A.C. Lorena, A.C.P.L.F. Carvalho, A study on class noise
detection and elimination, in: A.C. Lorena, C.E. Thomaz, A.T.R. Pozo
(Eds.), 2012 Brazilian Symposium on Neural Networks (SBRN), IEEE, 2012,
pp. 1318. http://dx.doi.org/10.1109/SBRN.2012.49.
[9] M. Smith, T. Martinez, C. Giraud-Carrier, An instance level analysis of data
complexity, Mach. Learn. 95 (2) (2014) 225256. http://dx.doi.org/10.1007/
s10994-013-5422-z.
[10] B. Frenay, M. Verleysen, Classication in the presence of label noise: a survey,
IEEE Trans. Neural Netw. Learning Syst. 99 (2015) 125. http://dx.doi.org/10.
1109/TNNLS.2013.2292894.
[11] T.K. Ho, M. Basu, Complexity measures of supervised classication problems,
IEEE Trans. Pattern Anal. Mach. Intell. 24 (3) (2002) 289300. http://dx.doi.
org/10.1109/34.990132.
[12] G. Morais, R.C. Prati, Complex network measures for data set characterization,
in: 2013 Brazilian Conference on Intelligent Systems (BRACIS), 2013, pp. 1218.
http://dx.doi.org/10.1109/BRACIS.2013.11.
[13] L.F. Costa, F.A. Rodrigues, G. Travieso, P.R.V. Boas, Characterization of complex
networks: a survey of measurements, Adv. Phys. 56 (2008) 167242.
[14] E. Kolaczyk, Statistical Analysis of Network Data: Methods and Models, in:
Springer Series in Statistics, Springer, 2009.
[15] X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study, Artif. Intell.
Rev. 22 (3) (2004) 177210. http://dx.doi.org/10.1007/s10462-004-0751-8.
[16] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81106.
http://dx.doi.org/10.1007/BF00116251.
[17] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New
York, Inc., 1995.
[18] E. Eskin, Detecting errors within a corpus using anomaly detection, in:
Proceedings of the 1st North American Chapter of the Association for
Computational Linguistics Conference, NAACL 2000, Association for Computational Linguistics, 2000, pp. 148153.
[19] A. Ganapathiraju, J. Picone, Support vector machines for automatic data
cleanup, in: INTERSPEECH, ISCA, 2000, pp. 210213.
[20] L. Li, Y.S. Abu-Mostafa, Data Complexity in Machine Learning, Technical
Report. CaltechCSTR:2006.004, Caltech Computer Science, 2006.
[21] T.K. Ho, Data complexity analysis: linkage between context and solution in
classication, in: Structural, Syntactic, and Statistical Pattern Recognition, vol.
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
5342 of Lecture Notes in Computer Science, 2008, pp. 986995. http://dx.doi.

org/10.1007/978-3-540-89689-0_102.
S. Singh, Prism: a novel framework for pattern recognition, Pattern Anal. Appl.
6 (2) (2003) 134149. http://dx.doi.org/10.1007/s10044-002-0186-2.
J.A. Sez, J. Luengo, F. Herrera, Predicting noise ltering efcacy with data
complexity measures for nearest neighbor classication, Pattern Recognit. 46
(1) (2013) 355364 http://dx.doi.org/http://dx.doi.org/10.1016/j.patcog.2012.
07.009.
L.P.F. Garcia, A.C. Carvalho, A.C. Lorena, Noisy data set identication, in: J.S. Pan, M. Polycarpou, M. Wozniak, A. Carvalho, H. Quintin, E. Corchado
(Eds.),
Hybrid
Articial
Intelligent
Systems,
Lecture
Notes
in
Computer Science, vol. 8073, Springer, Berlin, Heidelberg, 2013,
pp. 629638. http://dx.doi.org/10.1007/978-3-642-40846-5_63.
R.A. Mollineda, J.S. Snchez, J.M. Sotoca, Data characterization for effective prototype
selection, in: J.S. Marques, N.P. de la Blanca, P. Pina (Eds.), Pattern Recognition and
Image Analysis, Lecture Notes in Computer Science, vol. 3523, Springer, Berlin,
Heidelberg, 2005, pp. 2734. http://dx.doi.org/10.1007/11492542_4.
A. Orriols-Puig, N. Maci, T.K. Ho, Documentation for the Data Complexity
Library in C , Technical Report, La Salle Universitat Ramon Llull, 2010.
N. Ganguly, A. Deutsch, A. Mukherjee, Dynamics on and of Complex Networks:
Applications to Biology, Computer Science, and the Social Sciences, Modeling
and Simulation in Science, Engineering and Technology, Birkhuser, Boston,
2009.
X. Zhu, J. Lafferty, R. Rosenfeld, Semi-Supervised Learning with Graphs (Ph.D.
Thesis), Carnegie Mellon University, Language Technologies Institute, School
of Computer Science, 2005.
D.R. Amancio, C.H. Comin, D. Casanova, G. Travieso, O.M. Bruno, F.A. Rodrigues,
L. da F. Costa, A systematic comparison of supervised classiers, PLoS ONE 9
(4), 2014, e94137, http://dx.doi.org/10.1371/journal.pone.0094137.
K. Bache, M. Lichman, UCI Machine Learning Repository, http://archive.ics.uci.
edu/ml, 2013.
J. Alcal-Fdez, A. Fernndez, J. Luengo, J. Derrac, S. Garca, Keel data-mining
software tool: data set repository, integration of algorithms and experimental
analysis framework, Mult.-Valued Logic Soft Comput. 17 (23) (2011)
255287.
N. Maci, E. Bernad-Mansilla, Towards UCI : a mindful repository design,
Inf. Sci. 261 (2014) 237262. http://dx.doi.org/10.1016/j.ins.2013.08.059.
C.-M. Teng, Correcting noisy data, in: I. Bratko, S. Dzeroski (Eds.), ICML,
Morgan Kaufmann, 1999, pp. 239248.
C. Giraud-Carrier, T. Martinez, An Efcient Metric for Heterogeneous Inductive
Learning Applications in the Attribute-Value Language, Technical Report,
University of Bristol, Bristol, UK, 1995.
G. Csardi, T. Nepusz, The Igraph software package for complex network
research, InterJ. Complex Syst. 34 (2006) 695, URL http://igraph.sf.net.
D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learning
algorithms, Mach. Learn. 38 (3) (2000) 257286.
D.L. Wilson, Asymtoptic properties of nearest neighbor rules using edited data,
IEEE Trans. Syst. Man Cybern. 2 (3) (1972) 408421.
119
Lus P.F. Garcia received the B.Sc. degree in Computer
Engineering in 2010, from University of So Paulo,
Brazil. He is currently a Ph.D. candidate at the Institute
of Computer Sciences and Mathematics, University of
So Paulo, Brazil. His current research interests include
Machine Learning, Data Mining, Meta-learning and
Data Preprocessing.
Andr C.P.L.F. de Carvalho is a Full-time Professor in

the Department of Computer Science, University of So
Paulo, Brazil, where he was head of Department from
2008 to 2010. He received his B.Sc. and M.Sc. degrees in
Computer Science from the Universidade Federal de
Pernambuco, Brazil. He received his Ph.D. degree in
Electronic Engineering from the University of Kent, UK.
He has published around 80 Journal and 200 Conference refereed papers. He has been involved in the
organization of several conferences and journal special
issues. His main interests are Machine Learning, Data
Mining and Hybrid Intelligent Systems. He is in the
editorial board of several journals and was a member of
the Brazilian Computing Society, SBC, Council. He was the editor of the SBC/Elsevier
textbook series until July 2011. Prof. de Carvalho is the Director of the Research
Center for Machine Learning in Data Analysis of the Universidade de Sao Paulo,
Brazil.
Ana C. Lorena has bachelor (2001) and Ph.D. (2006)

degrees in Computer Science from the University of So
Paulo So Carlos. She is currently an Assistant
Professor at the Federal University of So Paulo. She
has experience in Computer Science, working mainly
on the following topics: data mining, supervised
machine learning, hybrid intelligent systems and support vector machines.

Effect of Label Noise in The Complexity of Classification Problems

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Effect of Label Noise in The Complexity of Classification Problems

Încărcat de

Drepturi de autor:

Formate disponibile

Universidade de So Paulo

Biblioteca Digital da Produo Intelectual - BDPI

Artigos e Materiais de Revistas Cientficas - ICMC/SCC

Effect of label noise in the complexity of

Neurocomputing 160 (2015) 108119

Contents lists available at ScienceDirect

Effect of label noise in the complexity of classication problems

Corresponding author. Tel.: 55 16 3373 8161; fax: 55 16 3373 9633.

For such, we employ a series of statistic and geometric descriptors

Show that the presence of label noise at different levels

inuences the complexity of a classication problem. This is

L.P.F. Garcia et al. / Neurocomputing 160 (2015) 108119

a graph structure. These measures consider distinct topological

This paper is structured as follows. Section 2 denes label

3. Complexity indices for describing data

L.P.F. Garcia et al. / Neurocomputing 160 (2015) 108119

usually based on linearity and on the distance between

pcj pck cij cik 2

For continuous features, cij and cij 2 are, respectively,

3.2. Measures of class separability

3.1. Measures of overlapping in feature values

High values of F1 indicate that at least one of the features

Distance of erroneous instances to a linear classier (L1): This

L.P.F. Garcia et al. / Neurocomputing 160 (2015) 108119

3.3. Measures of geometry and topology

true because of the postprocessing of edges connecting

The same reasoning of edge-related measures applies to

Since the closeness measure uses the inverse of the

The indice for Betweenness will be small for simpler

L.P.F. Garcia et al. / Neurocomputing 160 (2015) 108119

3.5. Summary of measures

For the AvgPath measure, high values are expected for

Overlapping in feature values

Geometry and topology

L.P.F. Garcia et al. / Neurocomputing 160 (2015) 108119

For the real datasets, we selected 53 benchmarks from UCI and

L.P.F. Garcia et al. / Neurocomputing 160 (2015) 108119

The rst analysis veries if there is a direct relation between the

5. Results of correlation analysis

This section presents the experimental results for the correlation

Fig. 1. Histogram of each measure for distinct noise levels.

L.P.F. Garcia et al. / Neurocomputing 160 (2015) 108119

some articial datasets as described in Section 4.1. These results were

5.2. Correlation between measures

5.1. Correlation of measures to the noise level

Fig. 2. Correlation of each measure to the noise levels.

L.P.F. Garcia et al. / Neurocomputing 160 (2015) 108119

AvgPath 0.15 0.07 0.08 0.22 0.29 0.19 0.21 0.01

0.48 0.03 0.43 0.44 0.37 0.22 0.06 0.06 0.02

Hub 0.15 0.27 0.24 0.34

0.16 0.28 0.04 0.04 0.06

0.37 0.17 0.33

0.22 0.38 0.65 0.58 0.62 0.58 0.2 0.06 0.08

0.21 0.53 0.12

0.07 0.03 0.09 0.07 0.02 0.33

0.15 0.05 0.17

L1 0.64 0.02 0.57

0.78 0.36 0.06

0.04 0.08 0.04

0.33 0.58 0.36 0.39

0.06 0.02 0.62 0.06 0.44 0.02

0.01 0.07 0.58 0.02 0.41 0.06 0.01

0.33 0.16 0.03 0.03 0.38 0.01 0.19 0.22