Sunteți pe pagina 1din 17

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception


of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Adaptive Learning-Based k-Nearest Neighbor


Classifiers With Resilience to Class Imbalance
Sankha Subhra Mullick, Shounak Datta, and Swagatam Das , Senior Member, IEEE

Abstract— The classification accuracy of a k-nearest neighbor characteristics of g(.). In this stage, for all data points ∈xi P,
(kNN) classifier is largely dependent on the choice of the number where i 1,= 2, . . . n, the value of g(x i ) is available to the
of nearest neighbors denoted by k. However, given a data set, classifier. Once trained, the classifier is expected to correctly
it is a tedious task to optimize the performance of kNN by tuning
k. Moreover, the performance of kNN degrades in the presence predict the value of g(yi ) for a new data point y∈ i Q (Q X,
of class imbalance, a situation characterized by disparate | |m,
Q = and i 1, 2, . . . , m). This is called the testing
representation from different classes. We aim to address both phase. The data point yi is known as a test or query point,
the issues in this paper and propose a variant of kNN called while the set Q of all such points is called a test set.
the Adaptive kNN (Ada-kNN). The Ada-kNN classifier uses the
density and distribution of the neighborhood of a test point and
The k-Nearest Neighbor (kNN) classifier has always been
learns a suitable point-specific k for it with the help of artificial preferred for its methodical simplicity, nonparametric working
neural networks. We further improve our proposal by replacing principle [1], and ease of implementation. The kNN classifier
the neural network with a heuristic learning method guided by involves tuning of a single parameter k (the number of nearest
an indicator of the local density of a test point and using neighbors to be considered). However, it is not easy to find
information about its neighboring training points. The proposed
heuristic learning algorithm preserves the simplicity of kNN the value of k for which the algorithm performs optimally
without incurring serious computational burden. We call this on a wide range of data sets (or for all the points in the
method Ada-kNN2. Ada-kNN and Ada-kNN2 perform very same data set). Theoretical studies suggest that the number
competitive when compared with kNN, five of kNN’s state-of-the- of points in the training set (say n) and the value of k
art variants, and other popular classifiers. Furthermore, we (1 k≤n) both control the performance of the kNN
propose a class- based global weighting scheme (Global
Imbalance Handling Scheme or GIHS) to compensate for the algorithm [2]. Furthermore, if k=1, then the probability of
effect of class imbalance. We perform extensive experiments on a misclassification will be bounded above by twice the risk of
wide variety of data sets to establish the improvement shown by the Bayes decision rule as n →∞ [2]. However, depending on
Ada-kNN and Ada- kNN2 using the proposed GIHS, when the data set, choices other than k=1 may be more suitable [3].
compared with kNN, and its 12 variants specifically tailored for Therefore, the theory discussed in [2] and [4] does not help
imbalanced classification.
with the choice of k in practical cases. Usually, a global k is
Index Terms— Heuristic learning, imbalanced classification, k- chosen, i.e., a single value of k for classifying all test points.
nearest neighbor (kNN), parameter adaptation, supervised
Conventional choices of such a global k value are 1, 3, 5, 7,
and 9 [5], [6], but may also be as large as k = √n [1], [7].
learning.

To optimize the performance, kNN is commonly run with a


I. I NTRODUCTION
number of different k values. Subsequently, several
A. Overview
techniques, such as cross validation and probabilistic

C LASSIFICATION can be posed as the task of pre-


dicting a many-to-one mapping g(.) from a set X of
D-dimensional data points (thus X ⊂ R D , assuming that the
estimation [8], [9], may be used to choose the best k value
among the tested k val- ues. While probabilistic modeling-
based algorithms are hard to implement and usually depend on
categorical features are replaced by suitable real values) to a prior assumptions about the data set, the technique of cross
set of class labels C = {1, 2, . . . } , C . A classifier is validation is computationally rather expensive. Moreover, as
designed for the purpose of estimating the properties of the the distribution of the classes is not known a priori, any
mapping g : X → C. First, in the training phase, the choice of global k stands a risk of ignoring the local
classifier is fed distribution of the neighborhood of a test point, whereas the
with a training set P ( P ⊆ X and | P | = n) to learn about the consideration of the unique features of the locality of a test
Manuscript received December 28, 2016; revised July 11, 2017 and point should decrease the chance of misclassification of that
November 29, 2017; accepted February 27, 2018. (Corresponding author: point. In this paper, these facts encouraged us to choose a
Swagatam Das.) data-point-specific k value using an indicator of the local
The authors are with the Electronics and Communication Sciences
Unit, Indian Statistical Institute, Kolkata 700108, India (e-mail: density and class distributions of its neighborhood.
sankha_r@isical.ac.in; shounak.jaduniv@gmail.com; swagatam.das@ Besides the difficulty with the selection of k, kNN classifi-
isical.ac.in). cation rule also faces a challenge over the data sets with class
This paper has supplementary downloadable material for online publication
only, as provided by the authors. This includes the detailed proof of some imbalance, i.e., when all the classes do not have comparable
theorems stated in the main paper, details of the parameter tuning, experi- number of representatives [10]. Hence, we also introduce a
mental results on individual datasets, and explanation of biases of some of class-specific global weighting scheme to tackle the issue of
the indices used. This material is 260 KB in size.
Color versions of one or more of the figures in this paper are available class imbalance.
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2018.2812279
2162-237X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

B. Motivation
Ada-kNN method1 first finds for each of the training points,
The conventional and widely accepted practice of using a a value of k for which kNN can accurately classify that point.
global k value completely ignores the local data density and Assuming this k value to approximate the local information
class distribution (also referred to as local information in the of the neighborhood of a data point, our algorithm tries to
subsequent text) of a test point’s neighborhood. Majority of estimate a suitable k value for each query point. This gives
the practical data sets have a global density and distribution rise to a nonlinear regression problem, which can be solved by
of classes differing widely from those in the neighborhood a feedforward Multi-Layer Perceptron (MLP) using the Scaled
of a test point. In such data sets, the information of the Conjugate Gradient (SCG) learning algorithm [11] to estimate
locality around a test point often proves to be vital for the k-terrain.
achieving the right classification. In the case of the kNN We further propose a new learning approach that directly
classifier, the locality information around a test point should utilizes the correctly classifying k values of the neighbors
be considered while choosing a suitable k value for it. This to estimate a suitable k value for each query point. We use
would result in data-point-specific choice of k. Since data the nearest neighbor distance as an indicator of the local
points belonging to the same locality are likely to be correctly density for a query point and to proportionally vary the
classified by similar values of k, the optimal k value can be size of its neighborhood. The resultant algorithm is called
modeled as a suitable nonlinear function of the training points Ada-kNN21. Both the proposed methods are compared with
(see Section III-D) called k-terrain. Hence, the proposed Ada- kNN, Linear Discriminant Analysis (LDA) [12], MLP using
kNN classifier, which learns the k-terrain using a supervised Back-Propagation [13] (MLP-BP), MLP using SCG [11]
learner, is likely to perform better than the traditional kNN (MLP-SCG), Decision Tree (DTs) [14], AkNN [15], Extended
and its variants that use a global k value. Nearest Neighbor (ENN) [16], dyn-kNN [17], Fuzzy Analogy-
It is worth noting that Ada-kNN is reliant on the efficacy Based Classification (FABC) [18], and k ∗ Nearest Neighbor
of an additional learning algorithm which further adds to (k∗NN) [19] over 17 data sets (over six data sets for k∗NN
the computational complexity. As the knowledge about the and eight data sets for FABC) of varying properties. Both the
entire nonlinear function of k is not an absolute necessity, proposed methods are observed to perform competitively,
an approximation of the function in the vicinity of a test point with Ada-kNN performing equivalent to the best on most
is likely to suffice. Thus, the number of neighbors, which small- and medium-scale data sets, while Ada-kNN2 generally
should influence the k value of a query point, can also be achieves the best result among all the contenders.
estimated based on an indication of the local data density. The proposed classifiers can tackle the local variations of a
One may then use the correctly classifying k information of data set with the varying value of k, thus dispensing the need
these neighbors (reflective of the local information) to for local weighting schemes on imbalanced data sets. There-
estimate the optimal k value for the query point in question. fore, we define a simple Global Imbalance Handling Scheme
This idea motivated us to replace the supervised learner in (GIHS) and combine it with kNN, Ada-kNN, and Ada-
Ada-kNN with such a heuristic learning scheme, giving rise to kNN2 classifiers1 to observe the performance improvement
the Ada- kNN2 classifier which is also presented in this paper. for imbalanced data sets. We perform extensive experiments
Achieving good classification for imbalanced data sets is a using 22 two-class data sets and 11 multi-class data sets
challenge for the kNN classifier, as it relies on the expectation having varying degrees of imbalance and compare the
that in the neighborhood of a test point, the number of training performance of the proposed algorithms against that of the
points from its ideal class will be high. In reality, this may state-of-the-art variants of kNN for data imbalance [namely
not hold true for a test point belonging to the minority kNN, k Rare class Nearest Neighbor (kRNN) [20], k
class of an imbalanced data set, as most of its neighbors Exampler-based Nearest Neighbor (kENN) [21], k Positive-
may still be from one of the majority classes in spite of biased Nearest Neighbor (kPNN) [22], Class Conditional
an increase in the number of minority neighbors (due to the Nearest Neighbor Distri- bution (CCNND) [23], dyn-kNN
global scarcity of the minority points). In such a scenario, [17], Neighbor-Weighted kNN (NWkNN) [24], WkNN [25]],
the basic kNN classifier, which does not consider the class Ada-kNN and Ada-kNN2
imbalance, will wrongly assign the test point to a majority coupled with under-sampling [26] and over-sampling [27]
class. The class-specific weighted kNN classifier can be techniques. The proposed methods achieve competitive per-
proved to be efficient in such situations. However, while formance compared with the competing algorithms, with Ada-
choosing the weights, the local information should also be kNN2 coupled with GIHS being the best performer in most
considered. In Ada-kNN and Ada-kNN2, since the local cases.
information is accounted for by the point-specific value of k, a
global class- specific weighting scheme may suffice.
D. Organization
Therefore, we present a scheme to combine class-specific
global weights with the proposed classifiers to handle class We arrange the rest of this paper, subsequent to the
imbalance. introduction in Section I in the following manner. In Section
II, we present a brief but comprehensive review of the notable
works done in the field. In Section III, we describe Ada-kNN
C. Contribution and Ada-kNN2, followed by an explanation of GIHS for
The main contributions of this paper are in order. handling imbalance. In Section IV, we first discuss the choice
To facilitate the choice of data-point-specific k, the proposed 1
Codes are available from: https://github.com/SankhaSubhra/Ada-kNN.
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 3

of the data sets and introduce the indices for quantifying The algorithm then assigns to a training point xi , the value
the performance of the classification methods. The rest of of k which performs the best (in terms of global and local
Section IV contains experimental results and a comparative performance) over those training points of which xi is a
discussion of the proposed methods along with their con- neighbor. For a test point, however, the k value corresponding
tenders. We present our conclusion and the future directions to its nearest neighbor is used. Thus, the method uses disparate
of research in Section V. philosophies for assigning k values to the training and test
points. Furthermore, the algorithm suffers from a high time
II. BACKGROUND complexity.
In this section, we briefly describe the notable past works Anava and Levy [19] proposed k∗NN to estimate the class
label of a test point yi Q ∈as weighted average of the class
attempting to aid kNN classifier by incorporating information
labels of training points. They showed that for yi , only a
about the neighborhood of a test point and making it suitable
test point dependent finite number of nearest neighbors will
for handling class imbalance.
contribute to the calculation of its class label. The problem of
finding the point-specific number of neighbors and their cor-
A. Sensitivity to Locality Information of Nearest responding weights can be reduced to a convex optimization
Neighbor Classifiers problem, which can be solved exactly by the greedy method
In the kNN classifier, the local density around a test point proposed by the authors.
should ideally affect the choice of k (see [19, Fig. 1]), Ezghari et al. [18] introduced a new adaptive variant of
resulting in a better strategy for rightly classifying the test fuzzy kNN called FABC, which first transforms the data
point. This idea is employed by Wettschereck and Dietterich points from their original numerical space to a fuzzy
[28] in four of their proposed methods to determine test- membership space. A query point is classified by aggregating
point-specific values of k. The first two methods utilize the its similarities with each of the training points and their
information of the correctly classifying k values of the respective fuzzy membership values for the different classes.
neighbors of a test point. The third technique works by finding
the class-specific k values, which successfully classify a
maximum number of training members from their B. kNN Classifiers for Handling Class Imbalance
corresponding classes. In the fourth method, the test points are The classification accuracy in the presence of class imbal-
clustered, and each of the clusters uses a single k value for ance can be improved in one of the two directions. In the first
classifying all test points belonging to that cluster. direction, the original training set is kept unaltered, while the
Ougiaroglou et al. [29] proposed a dynamic kNN classi- classifier is modified to incorporate immunity against the
fier by modifying the incremental nearest neighbor searching effect of imbalance. Along the second line of thinking, the
algorithm [30]. Their method introduced three heuristics to focus is shifted from the classifier to the training set itself,
reduce the computational complexity of the incremental kNN which is properly pre-processed by either under-sampling (the
search. Wang et al. [15] incorporated an adaptive distance majority class) or over-sampling (the minority class) [10],
measure with kNN, forming an algorithm called AkNN. The [32]–[34].
method finds for each training point x∈i P, the distance rxi 1) kNN Variant Designed Specifically for Class Imbalance
of the nearest neighbor belonging to any other class. While Problem: As a primary approach, Zhang and Mani [35] used
finding the neighbors for a query point y∈i Q, the distance five different methods to under-sample the majority class such
between yi and xi is downscaled by corresponding rxi . that the effect of class imbalance can be alleviated.
Bhattacharya et al. [31] devised an algorithm for assigning Tan [24] proposed a weight selection scheme of the
a test-point-specific value of k. The method is based on weighted kNN method (called NWkNN) to handle imbalance.
neighborhood density [see [31, eq. (1)]] and certainty factor In this technique, each class is assigned a weight depending on
[see [31, eqs. (5) and (6)]]. Tang and He [16] proposed a its global probability and that of the smallest class. However,
variant of the classical kNN decision rule known as the ENN the algorithm introduces an additional data-dependent parame-
method. The key feature of their algorithm is the consideration ter, which needs to be optimized.
of not only the k neighboring training points of the test point Wang et al. [36] modified the extended kNN proposed
in question, but also those training points which include the by Wang and Bell [37] to make it suitable for handling
query point in their k neighborhoods. The authors iteratively imbalance. The authors defined class-specific weights in the
assign a test point to each of the possible classes and calculate form of differences between the local (calculated over the k
a class-specific statistic [see [16, eqs. (4) and (5)]]. That test neighborhood) and global (calculated over P) probabilities,
point is then assigned to the class having the highest value of respectively, for each of the classes.
the statistic. Liu and Chawla [38] developed a weight selection
Another work by Garcia-Pedrajas et al. [17] (hereafter technique for handling class imbalance. They reduced the
called dyn-kNN) approached the task of a test-point-specific problem of kNN to finding the maximum among the
k value by considering both the global and local implications probabilities of a test point yi belonging to each of the classes,
of each choice of k. The algorithm evaluates the global given the k neighborhood of yi . To achieve this, the authors
performance for all k k∈ min,
[ kmax . A training-point-specific
] local used the posterior probabilities of the classes as weights
performance for each of the k values is also maintained. (estimated by mixture models or Bayesian network). Li and
Zhang [21] presented an idea (called kENN) of extending
some of the
4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

minority class members to Gaussian balls, to compensate A. kNN and the Class-Specific Weighted kNN Classifiers
for the disadvantage of scarcity. Such a strategy effectively
reduces the distance between a test point and the minority Let the class label be denoted as c(xi )∈ for a training
class members so that the minority class gets an edge. point xi P,∈ for i 1, 2, . . . , n. The set C of all training
Dubei and Pudi [25] utilized a weighting scheme to aid points belonging to class c is denoted by |C |c = nc .
C c, and
the issue of rare class classification (hereafter called WkNN). The kNNs of a data point x ∈X , from a set S, form the
Their method incorporates a query-specific local weighting set S (x).
k Also, let us take I (.) to be an indicator function
alongside a set of global class-specific weights. However, defined as in (1), where cond is some condition to be satisfied
the method needs to employ additional kNN runs over the = 1, if cond is true
neighborhood of each query point. Kriminger et al. [23] used I (cond) (1)
0, if cond is false.
the nearest neighbor information to estimate the probability
of a class in their algorithm called CCNND. The algorithm We can now describe the basic kNN classification rule, with
for each class finds the probability of a test point belonging the help of these notations. For a test point yi , if the kNNs of
to that class given the distance of its nearest neighbor from yi in the training set P form the set P k(yi ), then the predicted
that class (using empirical cumulative distribution function of class label of yi , denoted as cˆ(yi ) , will be
the intraclass nearest neighbor distances). Zhang and Li [22]
proposed a variant of kNN called kPNN. The idea uses a cˆ(yi ) = arg max I (c(xi ) = c). (2)
dynamic neighborhood (containing at least k minority class c∈1,2,...,C
xi ∈ kP (yi )
members) and classifies based on the estimation of the locally
adjusted posterior probability of the minority class. As indicated by (2), the kNN classifier actually labels a new
Recently, Zhang et al. [20] proposed the kRNN, which uses point yi , as belonging to the class having the maximum
number of members in the set P (yk i ). In the class-specific
a further constrained definition of the dynamic neighborhood
proposed in [22] for each test point. Furthermore, the clas- weighted NN classifier [24], every class is associated with a
k
sifier estimates both the global and local adjusted posterior weight, and the number of points belonging to a certain class
probabilities for the minority class. is multiplied by the weight associated with that class. The
However, the problems with all these methods are the rest is similar to the conventional kNN classification, resulting
involvement of exhaustive search, the introduction of new in the assignment of the new point to the class producing
parameters, and significant computational overhead, which maximum weighted number of members in P k(yi ). If the
hinder the scalability and easy implementation of kNN. weight for the c th class is denoted by wc, then the weighted
2) Removing Imbalance by Pre-Process the Data kNN classification rule can be expressed as
Set: A data set can be pre-processed by under-sampling the
points of the majority classes or over-sampling the points cˆ(yi ) = arg max wc I (c (xi ) = c). (3)
of the minority classes such that the number of representa- c∈1,2,...,C
xi ∈ P
k (yi )
tives from all the classes becomes comparable. A common
The weights in (3) can have any real value greater than 0.
method of under-sampling is called Random Under-Sampling
The weighted kNN classifier is actually a more general form
(RUS) [26], which, as the name suggests, randomly discards
of the kNN classifier, which can be reduced to the basic
the representatives of the majority class from the training set,
kNN classifier by making all the weights equal. In weighted
until the effect of imbalance is sufficiently mitigated, whereas
kNN classification, the user enjoys the flexibility to introduce
a notable approach for over-sampling known as Synthetic
problem-specific class priorities, which can achieve better
Minority Over-Sampling Technique (SMOTE) [27] randomly
results.
creates new minority class points by interpolating between
the existing minority points and their neighbors. However,
the original training set characteristics are not entirely pre- B. Description of Ada-kNN
served through any of these techniques. In addition, both
The Ada-kNN classifier utilizes the information of the
methods introduce new parameters, which need to be properly
neighborhood of a test point and mitigates the problem of
tuned.
global choice of k, by introducing the concept of point-
specific k value. One way to do this is by finding the sets of k
III. P ROPOSED M ETHODS values correctly classifying each of the training points∈xi P and
employing a supervised learner to model the k values as a
Starting with some basic notations which we follow
function of their corresponding training vectors. This function
throughout this paper, we first describe the canonical and
can then be applied to a test point to identify a k value suitable
weighted kNN classification algorithms. Next, we propose the
for it.
Ada-kNN classifier and extend it to Ada-kNN2 using a heuris-
The Ada-kNN algorithm starts by finding the set of suc-
tic learning algorithm for point-specific k selection. cessful k values, K x i for each of the training points x∈i P.
Thereafter, we present GIHS and use it to augment Ada-kNN
One may perform an exhaustive search over the set of all
and Ada- kNN2 for imbalance classification problems.
possible k values, i.e., K 1, 2, . . . , kmax , to find the set Kx i
Finally, we finish the section with the computational = { }
xi P.
∀ Here,
∈ kmax n, which √
= satisfies the following desirable
complexity analysis of the proposed algorithms. properties.
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 5

1) kmax should neither be too large nor too small to preserve


the local structure of the neighborhood.
2) kmax should be data-dependent. a set {0, 1} defined as follows:

∈ 1, if x ∈ X is correctly classified by kNN
⎪ (4)
f k (x) = ⎨ ⎪ using k number of neighbors
However, exhaustive searching for all k K is a compu- ⎩
tationally expensive process, especially when the training set 0, otherwise.
contains a large number of instances. Furthermore, the set K x i We further define
for each xi ∈P generally consists of subsets of consecutive k
values for which the point xi is correctly classified. These zk = fk (xi ) (5)
observations lead us to perform the search over a random xi ∈ Pν(yi )
permutation of K , denoted by Krand, instead of K . Such a
randomized search will reduce the chance of testing for which is the number of training set neighbors of yi that
consecutive values of k. Moreover, such strategy will pick are correctly classified by kNN using k neighbors. Hence,
any k in the range of [1, kmax] with equal probability. This by construction, zk is proportionate to the probability with
which a value of k correctly classifies points in P (yi ).
ensures a better cover of K in less number of average trials, ν
increasing the chance of finding a correctly classifying k value Therefore, the problem of finding k yi can be expressed as
for a training point. follows:
After finding Kx i ∀xi P, we may feed that information to ky i = arg max z k . (6)
an MLP and proceed to estimate the nonlinear k-terrain for k∈{1,2,...,kmax
}
a given data set. The learning phase of MLP requires single Given
be the K x sets for all the x i ∈ (yi ), z k will
kxi for each training point xi ∈ P, and thus, finding the entire i
P
ν
set Kx i is not an absolute necessity. However, such a choice of maximized for the choice of k present in the maximum
kxi may be susceptible to the noise present in the data set. number of such sets. Hence, the optimum k value will be
the mode
of the multiset y i formed by concatenating all K , x∀x i ∈
Hence, we recommend trying α number of trials using the first P
i

α elements of Krand and record all the correctly classifying (yi ). A random mode can be selected if yi is multimodal.
ν
k values in a set K x i (K x i K xi ). Only when all the α Now, we need to find a suitable value for ν. It is reasonable
trials are a failure, we look⊆for the first correctly classifying to assume that a test point lying in a dense region has
value similar characteristics with a larger number of training set
of k among the subsequent members of Krand. At the time of neighbors, whereas a test point residing in a sparse location
training the MLP, one can randomly choose a value of has similarities with only a small number of neighbors.
k xi ∈ K x i . Therefore, the number of neighbors should be varied
In the test phase of Ada-kNN, the trained MLP can be fed according to the local density in the vicinity of the test
with a query point yi ∈ Q to obtain the value of kyi . point. Therefore, ν
(hereafter called νyi for a test point yi Q)∈ should be a test
C. New Learning Method and Ada-kNN2 point-dependent parameter instead of a global constant. A
We use an MLP in the primary implementation of Ada- direct way to find an optimal value of νyi is to use the density
kNN to model the nonlinear function of correctly clas- sifying of the locality surrounding yi . However, instead of calculating
k values of the training points. However, a learning algorithm or approximating the actual local density, one may use an
easily computable indicator of it, such as the nearest neighbor
like MLP involves a number of parameters (the number of
distance. This claim is supported by the following
hidden layers, the number of hidden nodes in each layer,
theorem.
learning rate, and a choice of the training algorithm).
Theorem 1: Let a D-dimensional hypersphere contain some
Moreover, MLP incurs a high algorithmic complexity. Both
uniformly distributed points (including one lying at the
of these issues affect the simplicity of Ada-kNN. Furthermore,
center). Let the expected distances of the nearest neighbor
MLP uses single k xi ∈ K x i , while the set may contain at most from the
α elements, reducing the utilization of available information.
center be dN1 and dN2 , respectively, when the hypersphere
This motivated us to design a simple parameter-free heuristic
contains N1 and N2 number of points. If N1 > N2, then it can
learning algorithm. be shown that dN1 < dN2 .
The basic idea of the proposed heuristic scheme is based Proof: The proof follows from the derivation of Bhat-
on the assumption that a test point will have similar properties tacharyya and Chakrabarti (see [39, Sec. III]), which we detail
as its training set neighbors. Let us consider a k value, which in the Supplementary Material.
correctly classifies some of the training set neighbors of a From Theorem 1, it can be seen that the local density
test instance yi ∈Q. Such a k value can then be expected to bears an approximately inverse relation to the expected
correctly predict the class label of yi with a certain probability. nearest neighbor distance. Therefore, if the nearest neighbor
The heuristic attempts to find the value of k for which such a distance is large, we can assume a sparse locality, whereas a
probability is maximized over the neighborhood of yi . low distance will indicate a dense region. Let us denote the
To define the problem more formally, let us assume that the S as d S
distance from a point x to its nearest neighbor in a set 1NN
neighborhood of yi contains ν number of points. Furthermore,

1NN
(x). Therefore, for a point yi Q, the nearest training set neighbor is at a distance d P (y i). Let us denote
consider a family of functions F = { f1, f2, ... fkmax }. Each dmin = min d P\{xi }(xi ) and dmax = max d P\{xi }(xi ).
of the fk ∈ F is a many-to-one mapping from a data set X to xi ∈
1NN
xi ∈
1NN
P P
6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

We now describe three possible cases as follows.


P
1) d 1NN (yi ) < dmin: yi has a sufficiently dense locality, and
a suitably high number (denoted as νmax) of training set
neighbors should influence its k value.
P
2) d 1N (yi ) > dmax: yi has a scantly populated neighbor-
N
hood, and νyi is taken to be 1.
3) dmin ≤ d 1N P
(yi ) ≤ dmax : One should consider a νyi ∈
[1, νmax],Nfor finding the value of kyi .
We choose a data-dependent value of νmax 3 n, which √
=
satisfies the following desirable properties in the following.
1) νmax should be sufficiently small, as the local density
is likely to change with increase in the number of
neighbors.
2) νmax should not become too large even for data sets
containing very large number of data instances.
To estimate νyi when dmin ≤ d P (yi ) ≤ dmax, we use Fig. 1. Illustration of νyi estimation for a test point yi ∈ Q. Green curve:

expression (7) 1NN


exponential estimate decaying from νmax to 1. Blue line: linear estimate
decaying from νmax to 1. The respective estimates νexp(yi ) and νlin(yi )
νy = (νlin(yi )νexp(yi ))0.5 , (7) corresponding to d P (yi ) are also shown. νyi is taken to be the geometric
1NN
i mean of these two estimates.

where . denotes the ceiling function, TABLE I

, (8)
(1 − νmax) LIST OF Kxi AND RANDOMLY SELECTED kxi FROM Kxi , FOR EACH OF THE
ν (y ) = (y ) − + TRAINING POINTS xi ∈ P IN THE EXAMPLE
d ν
Pd
lin i 1NN i min max
(dmax −

and dmin)

ν (y ) = e 1NN (yi )−dmin β


, (9)
d
νexp i max P

having

loge νmax
β =− dmax − dmin . (10)

Here, νlin(yi ) is the linear estimate which models νyi to


be a linearly decreasing function of d P1NN(yi ) between νmax
and 1, which is useful for small data sets to maintain elements of K x i for all x i lying in the ν yi neighborhood of yi
a respectable neighborhood size. The exponential estimate to form the multiset yi . The value assigned to kyi will then be
νexp(yi ), on the other hand, is an exponentially decreasing the mode of yi .
P
function of d 1NN (yi ), between the same extremities as the
linear one, and is useful for larger data sets to restrict the
size of the neighborhood. Furthermore, the reason for using
geometric mean of the two estimates, instead of arithmetic
mean, is to allow the exponential estimate to have more
influence in the choice of νyi than the linear one (as geometric
mean is closer to the minimum, i.e., the exponential estimate).
This is done so that the neighborhood influencing the value of
k for a test point can be kept small even for large data sets.
We graphically show the situation in Fig. 1.
The Ada-kNN2 classifier differs from Ada-kNN in that the
MLP is replaced by the above-described heuristic learning
scheme. In Ada-kNN2, given a test point yi , the algorithm
starts by finding the nearest neighbor distance, which is then
compared with dmin and dmax to determine the value of νyi . The
algorithm then finds the set P (y ν i ) and concatenates the
D. Illustrative Example
We now illustrate Ada-kNN and Ada-kNN2 with the help
of a two-class classification data set. Let the training data set
be P containing 14 points from two classes 1 and 2. Thus, =
kmax 4,
νmax 3, and K 1, 2, 3, 4 . We present in Table I = = { }
training points xi along with their class labels c(xi ), sets Kx i
of correctly classifying values of k (ties broken arbitrarily),
and randomly selected values kxi K x i . ∈
The reader can make the following interesting observations
from Table I.
1) A single global k ∈ {1, 2, 3, 4} value may not be able
xi P xi to correctly classify all the data points, as K
φ. This establishes the importance of data-point-specific =∈
choice of k.
2) A data point can be usually classified correctly using
the same or close value of k as its neighbors, i.e., the
correctly classifying k values are influenced by the
locality information.
In Fig. 2, we plot each training point xi and the correspond-
ing kxi following Table I. This plot is discrete, and given a
new
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 7

Fig. 2. Training points and their corresponding values of k from the example, illustrating the estimated k-terrain and predicted k values for the test points.

point, the optimal k value cannot be directly estimated from F. Complete Algorithm
it. However, this information can be used to train an MLP to The complete algorithms for Ada-kNN and Ada-kNN2,
approximate the nonlinear function of k (referred as k-terrain). with the addition of GIHS, are schematically shown,
As MLP uses random kxi ∈ Kx i for a training point xi , respectively, in Algorithms 1 and 2. If we do not wish to
the learned k-terrain is not unique. One of such possible k- consider data imbalance, we can skip employing GIHS by
terrain values is shown in Fig. 2. To illustrate the efficiency of initializing the weight vector W with all 1s, in which case the
our proposed classifiers, we take a test set Q 2, 5=, {where } two algorithms reduce to regular Ada-kNN and Ada-kNN2.
c(2) 1 and=c(5) 2. If we use Ada-kNN, then we take the Both of the algorithms use a subroutine called TrainRoutine(
ceiling of the k-terrain values corresponding to the test point.
P, W, K, xi ), which is described in Procedure 1, for
Following this process, one may obtain k2 = 1
finding the set K x i .
and k5 = 3. In the case of Ada-kNN2, we find dmin = 0.1,
Hereafter, we will denote the i th member of a set S as Si . The
function Concat(si |i = 1, 2, . . . n) takes values s1, s2, . . .
sn
= d (2) = d (5) 0.2. = ν2 1
P P
dmax 0.9, 1NN
0.1, and 1NN
Thus, and concatenates them to form a multiset. The subroutine
and ν5 =2, which gives = 2 ={ } 1 and 5 =
1, {2, 3, 3, 4}. Mode(S) takes a multiset S as input and produces a mode of
Taking the mode from 2 and 5 gives k2 1 and k5=3, respectively, S as output (resolving ties randomly in case S is multimodal).
similar to that of Ada-kNN. One can verify using P and (2) The function Rand(S) takes a set S as input and outputs a
that k2 and k5 can, indeed, correctly classify the corresponding random element of S, while RandPerm(S) returns a random
test points. permutation of the elements of set S.

Procedure 1 TrainRoutine( P , W , K , xi )
E. Handling Imbalance in a Data Set
Initialize: K r and ← RandPerm(K ), K x φ.
As discussed in Section I-B, the class-specific weighted K x i ← all k among the first α elements of K r and that
kNN classifier can be proved to be efficient in situations successfully classifies xi using the training set P \ {x}i .
plagued by class imbalance. The class-specific weights can if K x =i φ then
be used to magnify the increments in the number of members K x i ← first k among the rest of the elements of K r and that
from the minority classes in the neighborhood of a test point successfully classifies xi using the training set P \{ x}i .
while diminishing those of the majority classes to compensate end if
for their abundance. return K x i
One way to weigh the classes is to use a global class-
weighting scheme, which uses the same set of class weights
for all test points. Since there should ideally be equal number G. Algorithmic Complexity of Ada-kNN and Its Variants
of representatives from each class in P, the ideal probability
of a point belonging to class c∈ Cshould be r = (1/C). But In this section, we analyze the computational complex-
most data sets are not ideal, and thus, we have nc training ities of the proposed classifiers. In Theorem 2, we ana-
points belonging to the class c (where nc values are not all lyze the asymptotic complexity of the regular kNN classi-
comparable). Hence,
C
nc = n, and the prior probability fier, which we then utilize to examine the complexities of
Ada-kNN, Ada-kNN2, and their GIHS-coupled versions in
of class c is p c/ )
c=1 c
n . In order for = (n
each class to have a
fair chance in the neighborhood of a test point irrespective of Theorems 3–5, respectively. Proofs of Theorem 2–5 can be
class imbalance, we assign the ratio of the ideal and current found in the Supplementary Material.
probabilities for a class c as the global weight wc, associated Theorem 2: The asymptotic time complexity for classifying
with that class, i.e., a set of points by the kNN classifier is O(n3).
Theorem 3: The asymptotic time complexity for classifying
r a set of points by the Ada-kNN classifier is O(max{ n2.5, nψ }),
wc = c. (11)
p
where the supervised k learner takes O(nψ ) time.
8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Algorithm 1 Description of the Proposed Ada-kNN Algo-


rithm When Coupled With GIHS
Input: Training set P, test set Q, α.
Output: The set of predicted class labels (C) for Q.
Calculate a set of global weights W = {w1, w2, · · · wC } for
all the data po ints x ∈ P ∪ Q, using the set P and (11).
Define kmax ← √n and K ← {1, 2, ··· , kmax}.
Training Phase:
for all xi ∈ P do
K x i← TrainRoutine( P, W, K , x i ).
end for
Train MLP with x i as input and k xi ← Rand(K x i ) as
desired output.
Fig. 3. Category of data sets.
Testing Phase:
for all yi ∈ Q do
Feed yi to MLP, which will output kyi .
IV. E XPERIMENTS AND R ESULTS
Ci ← classify yi by using P, W and kyi .
end for In this section, we evaluate the performance of our proposed
techniques in terms of standard indices using a collection of
data sets having diverse nature.
Algorithm 2 Description of the Proposed Ada-kNN2 Algo-
rithm When Coupled With GIHS
A. Description of the Data Sets
Input: Training set P, test set Q, α.
Output: The set of predicted class labels (C) for Q. We use standard data sets (a detailed description of which
can be found in the Supplementary Material) from the Univer-
Calculate a set of global weights W = {w1, w2,· · ·wC }for sity of California at Irvine, machine learning repository [40],
all the data points x ∈ P Q, using the set P and (11). the IDA benchmark repository [41], the KEEL repository [42],
Define kmax ← √ n ,∪K ← { 1, 2,· · ·, kmax }, λmax ← Akbani et al. [43], the agnostic learning versus prior knowl-
√3
n , and initialize d1N Ni ← 0; ∀ i = 1, 2,· · · , n. edge challenge database [44], and LibSVM repository [45]
Training Phase: according to the two following criteria.
for all xi ∈ P do 1) Scale of Data Set: If a data set contains more than
K x i ← TrainRoutine( P, W, K , x i ). 4000 data points and/or has a data dimension greater
d 1NN ←d (xi ).
i
P\{xi
1NN
} than 45, then we call it large-scale in this paper. All the
end for other data sets are considered as small/medium-scale.
Find dmin and dmax among all d1N Ni s. 2) Degree of Imbalance: This can be quantified in the
Testing Phase: form of imbalance ratio (IR). IR is defined for a two-
for all yi P∈ Q do
Find←d1NN (yi ). and ν yi using (7),
class data set as the ratio of the number of points in
yi Concat(K x |∀x i ∈
P
(y(8),
i )).
and (9).
the majority class to that of the minority class. In the
i
νy case of multi-class data sets, IR is taken to be the
kyi ← Mode( yi ). i
maximum of IR values calculated between all of the
Ci ← classify yi by using P, W and kyi . pairs of classes. Based on its IR value, a data set
end for can be either balanced (IR ≤ 1.15), mildly imbalanced
(1.15 < IR ≤ 3.5), or highly imbalanced (IR > 3.5).
A data set can be placed into any one of the six possible cate-
Theorem 4: The asymptotic time complexity for classifying gories, based on the two above-mentioned criteria, as
a set of points by the Ada-kNN2 classifier is O(n2.5). described in Fig. 3.
Theorem 5: The asymptotic time complexity of Ada-kNN
and Ada-kNN2 does not change if GIHS is coupled with
B. Indices for Evaluation of Classification Performance
them.
From Theorems 2 and 3, we can conclude that the asymp- We use four different indices to evaluate the performance
totic computational complexity of Ada-kNN is better than that of the contending classifiers.
of kNN. Moreover, from Theorems 3 and 4, we can conclude 1) Accuracy: In a C-class classification problem, let the
that Ada-kNN2 can achieve equal or better asymptotic com- test set contain m number of points. Suppose that the clas-
plexity than Ada-kNN depending on the choice of supervised sifier correctly classifies m of the test set members. Then,
learner. Furthermore, Theorem 5 confirms that the addition the accuracy is given by acc=(m /m). Accuracy is used to
of GIHS does not increase the asymptotic complexity of Ada- evaluate the performance for our experiments, which are not
kNN and Ada-kNN2. concerned with class imbalance.
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 9

2) Gmeans: Accuracy may still be rather high even if TABLE II


the classifier wrongly classifies most of the minority class COMPARISON OF DIFFERENT CLASSIFIERS IN TERMS OF ACCURACY
members while achieving high accuracy over the majority
classes. Gmeans [46] is a measure, which separately considers
the classifier’s performance on each of the classes. Here,
we extend it to the multi-class scenario in the following way.
Let us assume a C-class classification problem, where the
test set contains m points and m c points belong to class c
(c = 1, 2, . . . C). If we assume that the classifier correctly
classifies m c members of class c, then Gmeans is calculated
C (1/C)
c= = (
as GM (m
3) Area
/m )).cc1 Under Receiver Operating Characteristics Curve:
This index is defined as AUROC= ((1+ tpr fpr)/2) [47],
where tpr and f pr , respectively, stands −
for true positive rate TABLE III
and false positive rate [47], considering the minority class as ∗
COMPARISON OF FABC AND k NN WITH ADA-k NN AND ADAk NN2
positively labeled. In the case of a multi-class classification
problem, an extension of area under receiver operating char-
acteristics curve (AUROC) can be used [48].
4) Area Under Recall Precision Curve: Considering the
minority class members to be positively labeled, we calculate
the true positive rate (tpr ) and precision ( prec) of the
classifier [49]. The value of area under recall precision curve
(AURPC) is then obtained=as AURPC ((tpr prec)/2)
[49]. Similar to AUROC, one can extend the definition of
AURPC for the multi-class classification problem.
C. Experimental Procedure GIHS, NWkNN, kENN, kPNN (all with global k values
√ of
1, 3, 5, 7, 9, and n), CCNND (with nearest neighbor), dyn-
For all the experiments except the case involving k∗ NN , kNN (all parameters and fit ness evaluation set according
we perform a 10-fold cross validation (5-fold for fair compar- the literature), WkNN (with √n number of neighbors), kRNN
to
ison with k∗NN) to eliminate the training-test set dependence with
from our reported results. To further eliminate any bias gen-
erated from randomization involved in the algorithms, we run
them 10 times on each of the 10 training-test set pairs gener-
ated by cross validation [50]. The MLP used by Ada-kNN
has D nodes in the input layer, kmax nodes in the output
layer, and a single hidden layer containing 15 hidden nodes.
These settings for MLP are kept constant for all experiments,
as tuning them did not vary the results significantly. We have
used the default MATLAB implementation for MLP in our
experiments.
We perform two sets of experiments. In the first set,
we compare the performance of the regular kNN classifier,
the AkNN classifier, the ENN classifier (all with global k

values of 1, 3, 5, 7, 9, and n, where n is the number of
training points), MLP-BP, MLP-SCG (both using the default
MATLAB settings), DT (MATLAB implementation with
default settings), LDA, and dyn-kNN (implemented with set-
tings as advised in the literature) with those of Ada-kNN and
Ada-kNN2 classifiers, in terms of accuracy. We also compare
the performance of our proposed classifiers with those of
k∗NN and FABC reported, respectively, in [18] and [19].
These sets of experiments are performed to establish the
proposed classifiers as significant improvements over popular
classifiers and recent variants of kNN (in the absence of
severe class imbalance). For this reason, data sets belonging
to categories 1–4 are used in these experiments.
In the second set, we compare the performance of the regu-
lar kNN classifier and the regular kNN classifier equipped
(all parameters are set following the corresponding study),
Ada-kNN SMOTE and Ada-kNN2 SMOTE (the
number of neighbors is kept fixed at 5 as advised by the +
original article [27]; for two-class classification problem, the
minority class is over-sampled by 100%, 200%, 300%,
400%, and 500%, and for multiple classes, the choice of this
parameter is described in the Supplement Material), Ada-
kNN RUS and Ada-kNN2 RUS +
(for two-class data sets, the majority class is under-sampled +
by 10%, 30%, 50%, and 70%, and for multi- class problems,
the parameter is chosen according to the tech- nique
described in the Supplementary Material), Ada-kNN GIHS +
and Ada-kNN2 GIHS, in terms of GM, AUROC, +
and AURPC [51]. Since the kRNN, kPNN, and kENN
classifiers cannot be used on data sets containing multiple
classes, they are omitted from multi-class experiments. We
aim to evaluate the extent to which GIHS, SMOTE, or RUS
can improve the immunity of the proposed classifier against
imbalance, when compared with the state-of-the-art. For
this reason, we choose data sets belonging to categories 3–6
for these experiments.
We report the mean index values as well as Average Rank
(AR) over all experimented data sets for each of the
algorithms (ARSMS for Small and Medium-Scale data set
and ARLS for Large-Scale data sets are calculated
separately). We perform a nonparametric statistical test
known as the Wilcoxon rank sum test [52], to check if the
performance of Ada-kNN2 (Ada-kNN2 GIHS for +
experiments involving imbalanced classifiers, as we expect +
that the adaptiveness of Ada-kNN2 GIHS dispenses the need
for imbalance handling techniques
10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IV
COMPARISON OF CLASSIFIERS IN TERMS OF GM ON TWO-CLASS IMBALANCED DATA SETS

TABLE V
COMPARISON OF CLASSIFIERS IN TERMS OF AUROC ON TWO-CLASS IMBALANCED DATA SETS

with tunable parameters, such as SMOTE or RUS) for each medium-scale and 6 large-scale) data sets in terms of acc
data set differs significantly from the competitors with atleast in Table II. An inspection of the results on the small/medium-
5% level of significance. The result for this test is summarized scale data sets reveals that Ada-kNN2 achieved the best mean
(detailed in the Supplementary Material) in terms of Wins acc and ARSMS. The Ada-kNN algorithm managed to attain
(W), Ties (T), and Losses (L). No statistical test was the fifth place both in terms of average acc and ARSMS.
undertaken while comparing the performance of Ada-kNN and However, its performance is statistically comparable to that of
Ada-kNN2 Ada-kNN2 on 10 data sets.
with k∗NN and FABC, as the results for the latter contenders In the case of the large-scale data sets, Ada-kNN2 achieved
are, respectively, quoted from [18] and [19]. the second best average acc slightly lower than that of MLP-
SCG. However, Ada-kNN2 achieved the best ARLS, which
D. Choice of Parameter α for Ada-kNN and Ada-kNN2 confirms its better consistency and scalability. Ada-kNN,
A suitable large choice of α will produce a good coverage on the other hand, exhibits lower competence on the large-
of K x i for a point x∈i P, but will also increase the computational scale data sets, as is evident from the high ARLS and low
complexity. A small value of α, on the other hand, will reduce average acc. Ada-kNN failed to achieve good performance on
the computational complexity, but may increase the chance the large-scale data sets, possibly due to insufficient learning
of overfitting by decreasing the noise immunity. Furthermore, of the k-terrain by the underlying MLP, leading to erroneous
it is observed by us during experiments that tuning the value prediction of the kyi values.
of α does not significantly improve the results, suggesting The comparison between Ada-kNN, Ada-kNN2, and FABC
very little data dependence. These factors lead us to take a (over eight data sets) is summarized in the first three rows
constant value of α = 10 for all our experiments. of Table III, and it shows that Ada-kNN2 achieved the best
AR. Ada-kNN and FABC both achieved the same AR with
Ada-kNN attaining the higher average acc. The comparison
E. Performance of Ada-kNN and Ada-kNN2 Compared With
Other State-of-the-Art Classifiers in Terms of acc between the two proposed classifiers and k∗NN (on six data
sets) is summarized in the last three rows of Table III (the
We summarize the results of Ada-kNN, Ada-kNN2, and original study introducing k∗NN [19] used misclassification
the eight other competing algorithms over 17 (11 small/
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 11

TABLE VI
COMPARISON OF CLASSIFIERS IN TERMS OF AURPC ON TWO-CLASS IMBALANCED DATA SETS

TABLE VII
COMPARISON OF CLASSIFIERS IN TERMS OF GM ON MULTI-CLASS IMBALANCED DATA SETS

TABLE VIII
COMPARISON OF CLASSIFIERS IN TERMS OF AUROC ON MULTI-CLASS IMBALANCED DATA SETS

error as index, which we convert to acc for the sake of for the small/medium-scale data sets reveals that Ada-kNN2 +
consistency). Ada-kNN achieved both the best average acc GIHS, Ada-kNN + GIHS, and kPNN outperformed the other
and rank in this case, followed by Ada-kNN2. contenders. Ada-kNN2 + GIHS achieved the best average
These findings imply that the proposed classifiers outper- index value as well as ARSMS for both GM and AUROC.
form the state of the art on small/medium-scale data sets kPNN achieved slightly better index value and ARSMS com-
due to the adaptive choice of k values. Ada-kNN2 is also pared with Ada-kNN + GIHS for AURPC. However, the
able to retain the good performance on the large-scale data W-T-L counts for AURPC point toward the greater
sets, unlike Ada-kNN which seems to suffer from scalability consistency of +Ada-kNN2 GIHS. Therefore, the Ada-kNN2
issues. GIHS classifier seems to be the best choice
for this type of data sets.
F. Performance of Ada-kNN + GIHS and Ada-kNN2
GIHS on Two-Class Imbalanced Data Sets in Terms
On the seven high-dimensional data sets, Ada-kNN2 +
GIHS, kNN + GIHS, and kPNN achieved relatively bet-
of GM, AUROC, and AURPC
ter performance than the other competitors. Though kPNN
In Tables IV–VI, we summarize the results in terms of GM, achieved better average index values, the lower ARLS
AUROC, and AURPC for 15 small/medium and 7 large-scale
+
attained by Ada-kNN2 GIHS on GM as well as
two-class imbalanced data sets. An inspection of the results AUROC implies the greater consistency of the latter. The W-
T-L counts,
12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IX
COMPARISON OF CLASSIFIERS IN TERMS OF AURPC ON MULTI-CLASS IMBALANCED DATA SETS

which report statistically comparable AURPC performance for V. C ONCLUSION AND F UTURE WORK
Ada-kNN2 GIHS against kNN GIHS as well as kPNN,
+ We now summarize the various key findings of this paper
further attest to this fact. The slightly poorer performance of
Ada-kNN2 GIHS may be due to the increasing bias of the and present our concluding remarks. The proposed adaptive
+ choice of k values for the kNN classifier can prove to be
index toward the accuracy on the majority class with rising
IR, as shown by an example in the Supplementary Material. very useful for data classification over the conventional global
Hence, we conclude that Ada-kNN2 GIHS remains com- choices (such as k 1). = This is because such adaptive
+ of data sets as well.
petitive to the state of the art on this type choice of k can account for the properties of the neighborhood
Thus, Ada-kNN2 GIHS can outperform the state-of-the- of the test point, which may remain ignored by a single
+
art imbalanced classifiers over two-class imbalanced data sets global choice of k. The advantage of such an approach is
of varying scale owing to the adaptive choice of k augmented evident from the consistently commendable performance of
with GIHS. Ada-kNN and Ada-kNN2. GIHS is a simple weighting scheme
incurring constant time complexity, which does not involve
any additional parameters. Furthermore, in contrast to the pre-
G. Performance of Ada-kNN+GIHS and Ada-kNN2 process techniques used for imbalance handling, GIHS has
GIHS on Multi-Class Imbalanced Data Sets in Terms an additional advantage of preserving the class distributions
of GM, AUROC, and AURPC of the data set. GIHS is observed to further augment the
We report the summary of results in terms of GM, performance of Ada-kNN and Ada-kNN2 in the presence of
AUROC, and AURPC, for the 12 contending classifiers on class imbalance.
six small/medium and five large-scale multi-class imbalanced The MLP-based proposal (Ada-kNN) for adaptively choos-
data sets in Tables VII–IX. On small/medium-scale data sets, ing k values seems to suffer from scalability issues on large-
Ada-kNN2 + GIHS and kNN GIHS emerge as the top per- scale data sets. On the other hand, the proposed heuristic
formers. Ada-kNN2 + GIHS achieved better index values as counterpart (Ada-kNN2) is immune to such scalability issues
well as ARSMS for all three performance evaluating indices. as evident from its admirable performance on small as well
Among the contenders, kNN+GIHS demonstrated the second as large-scale data sets. Moreover, the improved performance
best performance in terms of both the ARMS and the average of Ada-kNN2 GIHS + is observed to be resilient to the
index value for all the indices. Therefore, Ada-kNN2+ GIHS increase in IR (see the Supplementary Material). An
is found to be a good choice for this type of data sets. interesting future extension of this paper may be to replace the
Ada-kNN2 + GIHS and kNN GIHS performed better MLP in Ada-kNN with a more scalable supervised learning
than the other contenders in terms of GM and AUROC for the algorithm. Another future direction should be to investigate
large-scale data sets. This indicates that both of these methods how different distance measures affect the performance of
were able to achieve equivalent classwise accuracies for all Ada-kNN and Ada-kNN2. Specifically, it may be interesting
the classes present in a data set. However, the performance of to study the effects (of the choice of distance measure) in the
Ada-kNN2+ GIHS in terms of AURPC lags behind those of case of high- dimensional data sets, where Euclidean distance
other methods, such as kNN, NWkNN, and dyn-kNN, due to is known to perform poorly [53].
the bias of the index as explained for the two-class imbalanced
case. Therefore, Ada-kNN2+ GIHS should be the method R EFERENCES
of choice for this type of data sets. Furthermore, the kNN
[1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.
+ GIHS classifier is able to retain the good performance in Hoboken, NJ, USA: Wiley, 2000.
terms of AURPC as well. This points toward the efficacy of [2] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
the simple, parameter-free GIHS proposal. Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.
[3] P. Hall, B. U. Park, and R. J. Samworth, “Choice of neighbor
Hence, one should employ Ada-kNN2+ GIHS for multi- order in nearest-neighbor classification,” Ann. Statist., vol. 36, no. 5,
class imbalanced data sets of varying scale, which further pp. 2135–2152, 2008.
indicates the usefulness of adaptive choice of k and the [4] D. O. Loftsgaarden and C. P. Quesenberry, “A nonparametric estimate
proposed weighting scheme. of a multivariate density function,” Ann. Math. Statist., vol. 36, no. 3,
pp. 1049–1051, 1965.
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 13

[5] C.-L. Liu and M. Nakagawa, “Evaluation of prototype learning algo- [30] G. Hjaltason and H. Samet, “Distance browsing in spatial databases,”
rithms for nearest-neighbor classifier in application to handwritten ACM Trans. Database Syst., vol. 24, no. 2, pp. 265–318, 1999.
character recognition,” Pattern Recognit., vol. 34, no. 3, pp. 601–615, [31] G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, “A probabilistic
2001. framework for dynamic k estimation in kNN classifiers with certainty
[6] C. Domeniconi, J. Peng, and D. Gunopulos, “Locally adaptive metric factor,” in Proc. 8th Int. Conf. Adv. Pattern Recognit., Jan. 2015, pp. 1–5.
nearest-neighbor classification,” IEEE Trans. Pattern Anal. Mach. [32] P. Branco, L. Torgo, and R. P. Ribeiro, “A survey of predictive mod-
Intell., vol. 24, no. 9, pp. 1281–1285, Sep. 2002. eling on imbalanced domains,” ACM Comput. Surv., vol. 49, no. 2,
[7] G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, “An affinity-based pp. 31:1–31:50, 2016.
new local distance function and similarity measure for kNN algorithm,” [33] V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An
Pattern Recognit. Lett., vol. 33, no. 3, pp. 356–363, 2012. insight into classification with imbalanced data: Empirical results and
[8] C. C. Holmes and N. M. Adams, “A probabilistic nearest neighbour current trends on using data intrinsic characteristics,” Inf. Sci., vol. 250,
method for statistical pattern recognition,” J. Roy. Statist. Soc. Ser. B, pp. 113–141, Nov. 2013.
Statist. Methodol., vol. 64, no. 2, pp. 295–306, 2002. [34] B. Krawczyk, “Learning from imbalanced data: Open challenges and
[9] A. K. Ghosh, “On optimum choice of k in nearest neighbor classifi- future directions,” Prog. Artif. Intell., vol. 5, no. 4, pp. 221–232, 2016.
cation,” Comput. Statist. Data Anal., vol. 50, no. 11, pp. 3113–3123, [35] J. Zhang and I. Mani, “kNN approach to unbalanced data distributions:
2006. A case study involving information extraction,” in Proc. ICML
[10] Y. Sun, A. K. Wong, and M. S. Kamel, “Classification of imbalanced Workshop Learn. Imbalanced Datasets, 2003, pp. 1–7.
data: A review,” Int. J. Pattern Recognit. Artif. Intell., vol. 23, no. 4, [36] L. Wang, L. Khan, and B. Thuraisingham, “An effective evi-
pp. 687–719, 2009. dence theory based k-nearest neighbor (kNN) classification,” in Proc.
[11] M. F. Møller, “A scaled conjugate gradient algorithm for fast supervised IEEE/WIC/ACM Int. Conf. Web Intell. Intell. Agent Technol., Dec. 2008,
learning,” Neural Netw., vol. 6, no. 4, pp. 525–533, Nov. 1993. pp. 797–801.
[12] R. A. Fisher, “The use of multiple measurements in taxonomic prob- [37] H. Wang and D. Bell, “Extended k-nearest neighbours based on
lems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188, 1936. evidence theory,” Comput. J., vol. 47, no. 6, pp. 662–672, 2003.
[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- [38] W. Liu and S. Chawla, “Class confidence weighted kNN algorithms
resentations by back-propagating errors,” Nature, vol. 323, no. 6088, for imbalanced data sets,” in Proc. 15th Pacific-Asia Conf. Adv. Knowl.
pp. 533–536, 1986. Discovery Data Mining, 2011, pp. 345–356.
[14] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA, [39] P. Bhattacharyya and B. K. Chakrabarti, “The mean distance to the
USA: Morgan Kaufmann, 1993. nth neighbour in a uniform distribution of random points: An appli-
[15] J. Wang, P. Neskovic, and L. N. Cooper, “Improving nearest neighbor cation of probability theory,” Eur. J. Phys., vol. 29, no. 3, p. 639,
rule with a simple adaptive distance measure,” Pattern Recognit. Lett., 2008.
vol. 28, no. 2, pp. 207–213, 2007. [40] M. Lichman. (2013). UCI Machine Learning Repository. [Online].
[16] B. Tang and H. He, “ENN: Extended nearest neighbor method for Available: http://archive.ics.uci.edu/ml
pattern recognition,” IEEE Comput. Intell. Mag., vol. 10, no. 3, pp. 52– [41] G. Rätsch. (2001). IDA Benchmark Repository. [Online]. Available:
60, Aug. 2015. http://ida.first.fhg.de/projects/bench/benchmarks.htm
[17] N. García-Pedrajas, J. A. D. Castillo, and G. Cerruela-Garcia, “A [42] I. Triguero et al., “KEEL 3.0: An open source software for multi-stage
proposal for local k values for k-nearest neighbor rule,” IEEE Trans. analysis in data mining,” Int. J. Comput. Intell. Syst., vol. 10, no. 1,
Neural Netw. Learn. Syst., vol. 28, no. 2, pp. 470–475, Feb. 2015. pp. 1238–1249, 2017.
[18] S. Ezghari, A. Zahi, and K. Zenkouar, “A new nearest neighbor classi- [43] R. Akbani, S. Kwek, and N. Japkowicz, “Applying support vector
fication method based on fuzzy set theory and aggregation operators,” machines to imbalanced datasets,” in Proc. Eur. Conf. Mach. Learn.,
Expert Syst. Appl., vol. 80, pp. 58–74, Sep. 2017. 2004, pp. 39–50.

[19] O. Anava and K. Levy, “k -nearest neighbors: From global to local,” in [44] I. Guyon. (2006). Datasets for the Agnostic Learning vs. Prior
Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 4916–4924. Knowledge Competition. [Online]. Available: http://www.agnostic.
[20] X. Zhang, Y. Li, R. Kotagiri, L. Wu, Z. Tari, and M. Cheriet, “KRNN: inf.ethz.ch/datasets.php
K rare-class nearest neighbour classification,” Pattern Recognit., vol. 62, [45] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vec-
pp. 33–44, Feb. 2017. tor machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3,
[21] Y. Li and X. Zhang, “Improving k nearest neighbor with exemplar pp. 27:1–27:27, 2011.
generalization for imbalanced classification,” in Proc. 15th Pacific-Asia [46] M. Kubat and S. Matwin, “Addressing the curse of imbalanced training
Conf. Adv. Knowl. Discovery Data Mining, 2011, pp. 321–332. sets: One-sided selection,” in Proc. 14th Int. Conf. Mach. Learn., 1997,
[22] X. Zhang and Y. Li, “A positive-biased nearest neighbour algorithm for pp. 179–186.
imbalanced classification,” in Advances in Knowledge Discovery and [47] M. A. Maloof, “Learning when data sets are imbalanced and when costs
Data Mining. Berlin, Germany: Springer, 2013, pp. 293–304. are unequal and unknown,” in Proc. Int. Conf. Mach. Learn., 2003,
[23] E. Kriminger, J. C. Príncipe, and C. Lakshminarayan, “Nearest neighbor pp. 1–5.
distributions for imbalanced classification,” in Proc. Int. Joint Conf. [48] D. J. Hand and R. J. Till, “A simple generalisation of the area under the
Neural Netw., 2012, pp. 1–5. ROC curve for multiple class classification problems,” Mach. Learn.,
[24] S. Tan, “Neighbor-weighted k-nearest neighbor for unbalanced text vol. 45, no. 2, pp. 171–186, 2001.
corpus,” Expert Syst. Appl., vol. 28, no. 4, pp. 667–671, 2005. [49] J. Davis and M. Goadrich, “The relationship between precision-recall
[25] H. Dubey and V. Pudi, “Class based weighted k-nearest neighbor over and ROC curves,” in Proc. 23rd Int. Conf. Mach. Learn., 2006,
imbalance dataset,” in Advances in Knowledge Discovery and Data pp. 233–240.
Mining. Berlin, Germany: Springer, 2013, pp. 305–316. [50] T. Raeder, T. R. Hoens, and N. V. Chawla, “Consequences of variability
[26] N. Japkowicz, “The class imbalance problem: Significance and strate- in classifier performance estimates,” in Proc. IEEE 10th Int. Conf. Data
gies,” in Proc. Int. Conf. Artif. Intell. (ICAI), 2000, pp. 111–117. Mining (ICDM), Dec. 2010, pp. 421–430.
[27] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, [51] T. Raeder, G. Forman, and N. V. Chawla, “Learning from
“SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. imbalanced data: Evaluation matters,” in Data Mining: Founda-
Res., vol. 16, no. 1, pp. 321–357, 2002. tions and Intelligent Paradigms. Berlin, Germany: Springer, 2012,
[28] D. Wettschereck and T. G. Dietterich, “Locally adaptive nearest pp. 315–331.
neighbor algorithms,” in Proc. Adv. Neural Inf. Process. Syst., 1994, pp. [52] M. Hollander and D. A. Wolfe, Nonparametric Statistical Methods.
184–191. Hoboken, NJ, USA: Wiley, 1999.
[29] S. Ougiaroglou, A. Nanopoulos, A. N. Papadopoulos, Y. Manolopou- [53] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the sur-
los, and T. Welzer-Druzovec, “Adaptive k-nearest-neighbor classifica- prising behavior of distance metrics in high dimensional space,”
tion using a dynamic number of nearest neighbors,” in Advances in in Proc. 8th Int. Conf. Database Theory, London, U.K., 2001,
Databases and Information Systems. Berlin, Germany: Springer, 2007, pp. 420–434.
pp. 66–82.

S-ar putea să vă placă și