Documente Academic
Documente Profesional
Documente Cultură
2015-07
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Instituto de Cincias Matemticas e de Computao, Universidade de So Paulo, Trabalhador So-carlense Av. 400, So Carlos, So Paulo 13560-970, Brazil
Instituto de Cincia e Tecnologia, Universidade Federal de So Paulo, Talim St. 330, So Jos dos Campos, So Paulo 12231-280, Brazil
art ic l e i nf o
a b s t r a c t
Article history:
Received 11 March 2014
Received in revised form
8 October 2014
Accepted 11 October 2014
Available online 11 February 2015
Noisy data are common in real-world problems and may have several causes, like inaccuracies,
distortions or contamination during data collection, storage and/or transmission. The presence of noise
in data can affect the complexity of classication problems, making the discrimination of objects from
different classes more difcult, and requiring more complex decision boundaries for data separation. In
this paper, we investigate how noise affects the complexity of classication problems, by monitoring the
sensitivity of several indices of data complexity in the presence of different label noise levels. To
characterize the complexity of a classication dataset, we use geometric, statistical and structural
measures extracted from data. The experimental results show that some measures are more sensitive
than others to the addition of noise in a dataset. These measures can be used in the development of new
preprocessing techniques for noise identication and novel label noise tolerant algorithms. We thereby
show preliminary results on a new lter for noise identication, which is based on two of the complexity
measures which were more sensitive to the presence of label noise.
& 2015 Elsevier B.V. All rights reserved.
Keywords:
Classication
Label noise
Complexity measures
Noise Filter
1. Introduction
Noisy data can be regarded as objects that present inconsistencies in their predictive and/or target attribute values [1]. These
inconsistencies can be either errors, absent information or
unknown values [2]. Whereas noise needs to be identied and
treated, secure data in a dataset must be preserved [3]. The term
secure data usually refers to instances that are core of the knowledge necessary to build accurate learning models.
There are various strategies and techniques in the literature to
handle with noise data [48,3,9]. Generally, the identied noisy
data are ltered and removed from the datasets. Nonetheless, it is
usually hard to determine if a given instance is indeed noisy or
not. Some recent proposals include designing classication techniques more tolerant and robust to noise, as surveyed in [10].
Despite the strategy employed to deal with noisy data, either by
data cleansing or by the design of noise-tolerant learning algorithms, it is important to understand the effect of noise in the
complexity of classication problems. This analysis can support
the development of new techniques for dealing with noisy data
more effectively.
This paper experimentally investigates the effect of distinct
levels of label noise in the complexity of classication problems.
2. Label noise
In supervised learning, a learning algorithm is applied to a
!
!
dataset containing n pairs x i ; yi , where each x i is a data point
described by m predictive features and yi corresponds to the
!
expected label of x i . In classication problems, this label corresponds to a class or a category. The learning algorithm then
induces a classication model, which should be able to predict
the label of new data points. The presence of noise in a dataset
affects its quality and may impair the predictive performance of
the classiers induced.
When dealing with supervised classication problems, noise
can be found in [15]: (a) predictive features and (b) target
attribute. Noise in predictive features is introduced in one or more
predictive attributes as a consequence of incorrect, absent or
unknown values. On the other hand, noise in target attributes
occurs in the class labels. They can be caused by errors or
subjectivity in data labeling, as well as by the use of inadequate
information in the labeling process. Lately, noise in predictive
features can lead to a wrong labeling of the data points, since they
can be moved to the wrong side of the decision frontier.
The majority of the existent Machine Learning (ML) algorithms
for solving classication problems try to minimize a cost function
based on misclassication, assuming the target label values in the
dataset as correct. Due to the importance of the class information
in supervised learning, in this paper, we deal with noise in the
target attribute only.
Learning in noisy environments has been largely investigated
in recent decades [10]. Some authors prefer to deal with the
problem while the classication model is induced, providing more
robustness to the induced models. The pruning process in Decision
Tree induction algorithms (DT) is an early initiative to increase
their robustness to noisy data [16]. Nonetheless, if the noise level
is high, the denition of the pruning degree can be challenging
and can ultimately remove branches that are based on safe
information too. Another example is the use of slack variables in
Support Vector Machines training [17], which allow some examples to be misclassied or to lie within the margins of separation
between the classes. This introduces an additional parameter to be
tuned during the SVM training: the regularization constant. This
constant accounts for the amount of training examples that can be
misclassied or be placed near the decision boundary induced.
Recent work addresses noise-tolerant classiers, where a label
noise model is learnt jointly to the classication model itself [18].
For such, typically, some information must be available about the
label noise or its effects [10]. The learning algorithm can also be
modied to embed data cleansing [19]. Other authors prefer to
109
treat noise previously, in a preprocessing step. Filters are developed for such, which scan the dataset for unreliable data [7,8,3].
The preprocessed dataset can then be used as input to any
classication algorithm.
One of the contributions of this paper is experimentally
showing how label noise affects the complexity of various classication datasets. Based on these results, we also present a new
noise ltering technique. This evidences the importance of understanding the inuence of label noise in classication problems.
Furthermore, these results may contribute to the proposal of other
lters and of novel noise tolerant algorithms.
110
Fisher's discriminant ratio (F1): Selects the feature that best discriminates the classes. It can be calculated by Eq. (1), for
binary classication problems, and by Eq. (2) for problems with more than two classes (C classes). In these
equations, m is the number of input features and fi is the
i-th predictive feature:
ci1 ci2 2
f
i1
F1 max
fi 2
fi 2
c1 c 2
PC
m
cj 1
PC
ck
i1
cj 1
PC
cj 1
pcj 2cj
cj
m minmaxf ; c ; maxf ; c maxminf ; c ; minf ; c
i 1
i 2
i 1
i 2
F2
i 1 maxmaxf i ; c1 ; maxf i ; c2 minminf i ; c1 ; minf i ; c2
4
In multiclass problems, the nal result is the sum of the
values calculated for the underlying binary subproblems. A
low F2 value indicates that the features can discriminate the
examples of distinct classes and have low overlapping.
Maximum individual feature efciency (F3): Evaluates the individual efcacy of each feature by considering how much
each feature contributes to the classes separation. This
measure uses examples that are not in overlapping
ranges and outputs an efciency ratio of linear separability. Eq. (5) shows how F3 is calculated, where n is the
number of examples in the training set and overlap is a
function that returns the number of overlapping examples between two classes. High values of F3 indicate the
presence of features whose values do not overlap between classes.
F3 max
i1
!f i !f i
n overlap x c1 ; x c2
n
F1 max
two classes.
cj
neighbor (1-NN) classier, with the leave-one-out strategy. Low values indicate a high separation of the classes.
111
1X
v
n i;j ij
1
i a j dvij
X dj vil
dvil
iajal
112
suitable for our scenario and are not used here. One example is the
number of nodes of the graph, which will not vary for a given
dataset despite of its noise level. The only measures in common to
those used in [12] are the number of edges, the average clustering
coefcient and the average degree. Morais and Prati [12] also do
not analyse the expected values of the measures for simpler or
more complex datasets. Besides introducing new measures, we
also describe the behavior of all measures for simpler or complex
problems. Moreover, we try to identify the best suited measures
for detecting the presence of label noise in a dataset.
number of edges between its neighbors and the maximum number of edges that could possibly exist between
these neighbors.
Measure ClsCoef will be larger for simpler datasets,
which will produce large connected components joining
vertices from the same class.
Hub score (Hubs): Measures the score of each node by the number
of connections it has to other nodes, weighted by the
number of connections these neighbors have. That is,
more connected vertices, which are also connected to
highly connected vertices, have higher hub score.
The hub score is expected to have a low mean for high
complexity datasets, since strong vertices will become
less connected to strong neighbors. For instance, hubs
are expected at regions of high density from a given
class. Therefore, simpler datasets with high density will
show larger values for this measure.
Average Path Length (AvgPath): Average size of all shortest paths in
the graph. It measures the efciency of information
spread in the network. It is illustrated by Eq. (10), where
n represents the number of vertices of the graph and
dvij is the shortest distance between vertices i and j.
AvgPath
X
1
dvij ;
nnn 1 i a j
10
Measure
Minimum value
Maximum value
Complexity
F1
F2
F3
0
0
0
1
1
1
Class separability
L1
L2
N1
N2
N3
0
0
0
0
0
1
1
1
1
1
L3
N4
T1
0
0
0
1
1
1
Structural representation
Edges
Degree
MaxComp
Closeness
Betweenness
Hubs
Density
ClsCoef
AvgPath
0
0
1
0
0
0
0
0
1=nnn 1
nnn 1=2
n1
n
1=n 1
n 1nn 2=2
1
1
1
0.5
4. Experiments
This section presents the experiments performed to evaluate
how the different data complexity measures from Section 3
behave in the presence of label noise for several benchmark public
datasets. First, a set of classication benchmark datasets were
chosen for the experiments. Different levels of label noise were
afterwards added to each dataset. We then monitor how the
complexity level of the datasets are affected by noise imputation.
This is accomplished by:
1. Verifying the Spearman correlation between the measures
values with the noise rates articially imputed. This analysis
allows identifying a set of measures more sensitive to the
presence of noise in a dataset.
2. Evaluating the correlation between the measures values to
identify those measures that (1) capture different concepts
regarding noisy environments and (2) can be jointly used to
support the development of new noise-handling techniques.
Next, we detail the experimental protocol previously outlined.
4.1. Datasets
For the experiments, we selected articial and real datasets.
The articial datasets were introduced and generously provided by
Amancio et al. [29]. These authors generated articial classication
datasets based on multivariate Gaussians, with different levels of
overlapping between the classes. For this paper we selected 180
balanced datasets (with the same number of examples per class)
with 2 classes, containing 2, 10 and 50 features and with different
overlapping rates for each of the number of features. We chose
such datasets based on the observations of a recent work [9],
which points out that class overlap seems to be a principal
contributor to instance hardness and that noisy data can ultimately be considered hard instances.
113
Table 2
Summary of datasets characteristics: name, number of examples, number of features, number of classes and the percentage of the majority class of each dataset.
Dataset
Examples
Features
Class
%MC
Dataset
Examples
Features
Class
%MC
abalone
acute-inammations
appendicitis
australian
backache
balance
banana
blood-transfusion-service
breast-tissue
bupa
car
cardiotocography
cmc
collins
connectionist-mines-vs-rocks
crabs
expgen
ags
are
glass
habermans-survival
hayes-roth
ionosphere
iris
kr-vs-kp
led7digit
leukemia-haslinger
4153
120
106
690
180
625
5300
748
106
345
1728
2126
1473
485
208
200
207
178
1066
205
306
160
351
150
3196
500
100
9
8
8
15
32
5
3
5
10
7
7
21
10
22
61
6
80
29
12
10
4
5
34
5
37
8
51
19
2
2
2
2
3
2
2
6
2
4
10
3
13
2
2
5
5
6
5
2
3
2
3
2
10
2
16
58
80
55
86
46
55
76
20
57
70
27
42
16
53
50
57
33
31
37
73
40
64
33
52
11
51
molecular-promoters
monk1
monk2
movement-libras
newthyroid
page-blocks
parkinsons
phoneme
pima
ringnorm
saheart
segmentation
spectf
statlog-german
statlog-heart
tae
tic-tac-toe
titanic
vehicle
vowel
waveform-5000
wdbc
wine
wine-quality
yeast
zoo
106
556
601
360
215
5473
195
5404
768
7400
462
2310
349
1000
270
151
958
2201
846
990
5000
569
178
6497
1484
84
58
7
7
91
6
11
23
6
9
21
10
19
45
21
14
6
10
4
19
11
41
31
14
12
9
17
2
2
2
15
3
5
2
2
2
2
2
7
2
2
2
3
2
2
4
11
3
2
3
6
9
4
50
50
65
06
69
89
75
70
65
50
65
14
72
70
55
34
65
67
25
09
33
62
39
43
31
48
114
distance-based measures employed the normalized euclidean distance for continuous features and the overlap distance for nominal
features (this distance is 0 for equal categorical values and 1 otherwise) [34]. To build the graph for the graph-based measures, we used
the -NN algorithm, with the threshold value equal to 15%, like in
[12]. The measures described in Section 3.4 were calculated using the
Igraph library [35].
As a result, we have one meta-level dataset that will be
employed in the subsequent experiments. It contains 20 metafeatures ( complexity and graph-based measures) describing the
benchmark datasets and their noisy versions. This meta-level
dataset has therefore 2173 examples: 53 ( original datasets)
53 ( datasets) n 4 ( noise levels) n 10 ( random versions).
Two types of analysis were performed using the meta-level dataset:
1. Correlation between the measure values and the noise level of
the datasets;
2. Correlation within the measure values.
Noise Rate
F1
0%
5%
10%
20%
40%
F2
20
F3
L1
15
15
15
15
10
10
L2
20
10
10
L3
15
N1
N2
10
20
20
10
10
10
5
0
Density
N3
N4
T1
15
20
20
Edges
20
10
10
10
10
5
Degree
Density
MaxComp
Closeness
10.0
20
20
12
7.5
5.0
10
10
2.5
0
0.0
Betweenness
Hub
ClsCoef
10.0
9
7.5
7.5
5.0
2.5
0.0
0.00
0.25
0.50
0.75
1.00
5.0
2.5
0
0.00
0.25
0.50
0.75
1.00
AvgPath
10.0
0.0
0.00
0.25
0.50
0.75
Normalized Range
1.00
0.00
0.25
0.50
0.75
1.00
115
capture: classes separability (N3, N1, N2, L2 and L1), data topology
according to a nearest neighbor classier (N4, T1 and L3) and
individual feature overlapping (F2). Regarding those measures
indirectly related to the noise levels, two are basic complexity
measures based on feature overlapping (F1 and F3), while six are
based on structural representation (Density, Hub, Degree, ClsCoef,
Edges and MaxComp). Only the Closeness and Betweenness
measures did not present signicant correlation to the noise levels.
As expected, the most prominent measures are the same that
showed more distinct values for different noise levels in the
histograms from Fig. 1.
Despite the statistical difference, it is possible to notice some
low correlation values in Fig. 2. Only the measures N3, N1 and N2
presented correlation values higher than 0.5. These correlations
were higher in the experiments with articial datasets. This can be
a result of the fact that, for real datasets, the amount of noise
added is potential rather than actual.
0.0
0.5
Measures
N3
N1
N2
N4
L2
L1
T1
F2
L3
AvgPath
Betweenness
Closeness
MaxComp
Edges
ClsCoef
Degree
Hub
F3
Density
1.0
F1
Correlation
0.5
116
0.3
0.78
MaxComp 0.51 0.2 0.58 0.21 0.18 0.01 0.01 0.02 0.06 0.36 0.12 0.95 0.97 0.09
Density
Degree
0.4
0.08
0.97
0.33
0.24
0.02
0.15
0.11
0.02
0.42
0.62
0.24
0.84
0.46
0.33
0.83
0.95
0.57
0.2
0.29
0.18
0.25
0.1
0.16
0.2
0.19
0.01
0.07
0.22
0.18
0.3
0.12
0.33
0.21
0.17
Correlation
1.0
0.5
0.0
0.5
1.0
0.37 0.29
0.31
0.27
0.37
0.21 0.37
Degree
Edges
T1
N4
N3
N1
L3
L2
0.34 0.72 0.64 0.62 0.01 0.35 0.24 0.36 0.45 0.08 0.38 0.4
0.2
0.16
ClsCoef
0.09
Hub
0.47 0.02
L1
1
F1
F1
0.09
0.57 0.51 0.15 0.11 0.09 0.18 0.28 0.09 0.48 0.53 0.17 0.58 0.24 0.43 0.08
F3
F2 0.34
Density
0.72 0.47
F2
F3
0.1
N2
Measures
0.12 0.17
0.15
0.16
0.52 0.03
0.42
0.01
0.98
0.37
0.06
L3
0.11
0.12
0.01
0.03 0.37
0.1
MaxComp
ClsCoef
0.2
Measures
Fig. 3. Heatmap of correlation between measures.
11
117
leukemiahaslinger
0.35
0.35
0.4
0.52
zoo
0.49
0.86
0.91
0.74
led7digit
0.3
0.51
0.52
0.52
yeast
0.41
0.38
0.42
0.51
krvskp
0.54
0.54
0.56
0.49
winequality
0.3
0.43
0.47
0.51
iris
0.74
0.74
0.78
0.88
wine
0.72
0.72
0.78
0.76
ionosphere
0.45
0.45
0.48
0.39
wdbc
0.59
0.59
0.67
0.79
hayesroth
0.4
0.4
0.42
0.37
waveform5000
0.42
0.42
0.46
0.57
habermanssurvival
0.31
0.31
0.33
0.36
vowel
0.31
0.83
0.87
0.62
glass
0.43
0.49
0.5
0.52
vehicle
0.45
0.45
0.49
0.5
flare
0.49
0.49
0.5
0.51
titanic
0.4
0.4
0.4
0.29
flags
0.36
0.29
0.33
0.41
expgen
0.67
0.67
0.7
0.79
crabs
0.6
0.6
0.62
0.47
connectionistminesvsrocks
0.44
0.44
0.47
0.32
collins
0.3
0.46
0.49
0.67
cmc
0.24
0.24
0.26
0.31
cardiotocography
0.31
0.47
0.49
0.61
car
0.62
0.62
0.71
0.6
bupa
0.24
0.24
0.25
0.29
tictactoe
0.6
0.6
0.68
0.48
tae
0.34
0.34
0.36
0.31
statlogheart
0.42
0.42
0.45
0.39
statloggerman
0.26
0.26
0.29
0.37
0.75
spectf
0.43
0.43
0.45
0.5
0.50
segmentation
0.76
0.76
0.81
0.84
0.25
saheart
0.26
0.26
0.28
0.34
ringnorm
0.3
0.3
0.27
0.09
pima
0.31
0.31
0.35
0.4
breasttissue
0.49
0.49
0.53
0.53
bloodtransfusionservice
0.37
0.37
0.39
0.43
phoneme
0.53
0.53
0.55
0.55
banana
0.57
0.57
0.6
0.47
parkinsons
0.57
0.57
0.61
0.61
0.62
pageblocks
0.79
0.79
0.84
0.89
0.55
newthyroid
0.76
0.76
0.79
0.82
0.37
movementlibras
0.3
0.52
0.55
0.43
0.45
monk2
0.41
0.41
0.44
0.37
0.78
0.81
monk1
0.39
0.39
0.42
0.49
0.2
0.34
molecularpromoters
0.42
0.48
balance
backache
australian
appendicitis
accuteinflammations
abalone
0.55
0.4
0.42
0.46
0.75
0.55
0.4
0.42
0.46
0.75
0.3
0.18
AENN
ENN
0.64
0.47
0.46
0.52
RENN GraphNN
Fig. 4. F-score of lters for each dataset.
0.4
0.4
AENN
ENN
RENN GraphNN
Fscore
118
0.5
Frequency of wins
0.4
0.3
0.2
0.1
Acknowledgments
0.0
AENN
ENN
GraphNN
RENN
Filters
The authors would like to thank FAPESP, CNPq and CAPES for
their nancial support.
References
7. Conclusion
This work investigated how label noise affects the complexity
of classication tasks, by monitoring the values of simple measures extracted from datasets with increasing noise levels. Part of
these measures were already used in the literature for understanding and analyzing the complexity of classication tasks.
Some other measures that are based on modeling the datasets
by graphs were introduced in this paper.
Experimentally, measures able to capture characteristics like
separability of the classes, alterations in the class boundary and
densities within the classes were the most affected ones by the
introduction of label noise in the data. Therefore, they are good
candidates for further exploitation and to support the design of
new noise identication techniques and noise-tolerant classication algorithms. Moreover, experimental results showed a low
correlation between the basic complexity measures and the graphbased measures, stressing the relevance of exploring different
views and representations of the data structure. Using two of
the most prominent measures from these analyses, we devised a
new noise ltering technique, which has shown promising results
in noise identication.
The graph-based measures Edges, Degree and Density were
highlighted in all analysis carried out. This may have occurred
because, when label noise is introduced, examples from distinct
classes become closer to each other and are not connected in the
graph. For the standard data complexity measures, those that rely
on nearest neighbor information, as N1 and N3, were also able to
better capture the effects of noise imputation. This is also due to
the fact that label noise tends to affect the spatial proximity of data
from different classes. Thus, the idea that data from the same class
tend to be next from each other in the feature space, while far
from examples from different classes, is reinforced.
Our experimental protocol and graph-based measures can also
be used in other types of analysis, such as in verifying the effects of
data unbalance, feature selection, feature discretization, among
[1] J.R. Quinlan, The effect of noise on concept learning, in: R.S.I. Michalski,
J.G. Carboneel, Mitchell (Eds.), Machine Learning, Morgan Kaufmann Publishers Inc., 1986, pp. 149166.
[2] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, Knowledge discovery and data
mining: towards a unifying framework, in: E. Simoudis, J. Han, U.M. Fayyad
(Eds.), KDD, AAAI Press, 1996, pp. 8288.
[3] B. Sluban, D. Gamberger, N. Lavrac, Ensemble-based noise detection: noise
ranking and visual performance evaluation, Data Mining Knowl. Discov. 28 (2)
(2014) 265303. http://dx.doi.org/10.1007/s10618-012-0299-1.
[4] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Trans. Syst.
Man Cybern. 6 (6) (1976) 448452. http://dx.doi.org/10.1109/TSMC.1976.4309523.
[5] C.E. Brodley, M.A. Friedl, Identifying and eliminating mislabeled training
instances, in: W.J. Clancey, D.S. Weld (Eds.), AAAI/IAAI, vol. 1, AAAI Press/The
MIT Press, 1996, pp. 799805.
[6] S. Verbaeten, A.V. Assche, Ensemble methods for noise elimination in
classication problems, in: T. Windeatt, F. Roli (Eds.), Multiple Classier
Systems, Lecture Notes in Computer Science, vol. 2709, Springer, Berlin,
Heidelberg, 2003, pp. 317325. http://dx.doi.org/10.1007/3-540-44938-8_32.
[7] B. Sluban, D. Gamberger, N. Lavrac, Advances in class noise detection, in:
H. Coelho, R. Studer, M. Wooldridge (Eds.), ECAI, Frontiers in Articial
Intelligence and Applications, vol. 215, IOS Press, 2010, pp. 11051106.
[8] L.P.F. Garcia, A.C. Lorena, A.C.P.L.F. Carvalho, A study on class noise
detection and elimination, in: A.C. Lorena, C.E. Thomaz, A.T.R. Pozo
(Eds.), 2012 Brazilian Symposium on Neural Networks (SBRN), IEEE, 2012,
pp. 1318. http://dx.doi.org/10.1109/SBRN.2012.49.
[9] M. Smith, T. Martinez, C. Giraud-Carrier, An instance level analysis of data
complexity, Mach. Learn. 95 (2) (2014) 225256. http://dx.doi.org/10.1007/
s10994-013-5422-z.
[10] B. Frenay, M. Verleysen, Classication in the presence of label noise: a survey,
IEEE Trans. Neural Netw. Learning Syst. 99 (2015) 125. http://dx.doi.org/10.
1109/TNNLS.2013.2292894.
[11] T.K. Ho, M. Basu, Complexity measures of supervised classication problems,
IEEE Trans. Pattern Anal. Mach. Intell. 24 (3) (2002) 289300. http://dx.doi.
org/10.1109/34.990132.
[12] G. Morais, R.C. Prati, Complex network measures for data set characterization,
in: 2013 Brazilian Conference on Intelligent Systems (BRACIS), 2013, pp. 1218.
http://dx.doi.org/10.1109/BRACIS.2013.11.
[13] L.F. Costa, F.A. Rodrigues, G. Travieso, P.R.V. Boas, Characterization of complex
networks: a survey of measurements, Adv. Phys. 56 (2008) 167242.
[14] E. Kolaczyk, Statistical Analysis of Network Data: Methods and Models, in:
Springer Series in Statistics, Springer, 2009.
[15] X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study, Artif. Intell.
Rev. 22 (3) (2004) 177210. http://dx.doi.org/10.1007/s10462-004-0751-8.
[16] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81106.
http://dx.doi.org/10.1007/BF00116251.
[17] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New
York, Inc., 1995.
[18] E. Eskin, Detecting errors within a corpus using anomaly detection, in:
Proceedings of the 1st North American Chapter of the Association for
Computational Linguistics Conference, NAACL 2000, Association for Computational Linguistics, 2000, pp. 148153.
[19] A. Ganapathiraju, J. Picone, Support vector machines for automatic data
cleanup, in: INTERSPEECH, ISCA, 2000, pp. 210213.
[20] L. Li, Y.S. Abu-Mostafa, Data Complexity in Machine Learning, Technical
Report. CaltechCSTR:2006.004, Caltech Computer Science, 2006.
[21] T.K. Ho, Data complexity analysis: linkage between context and solution in
classication, in: Structural, Syntactic, and Statistical Pattern Recognition, vol.
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
119
Lus P.F. Garcia received the B.Sc. degree in Computer
Engineering in 2010, from University of So Paulo,
Brazil. He is currently a Ph.D. candidate at the Institute
of Computer Sciences and Mathematics, University of
So Paulo, Brazil. His current research interests include
Machine Learning, Data Mining, Meta-learning and
Data Preprocessing.