Documente Academic
Documente Profesional
Documente Cultură
Abstract— The classification accuracy of a k-nearest neighbor characteristics of g(.). In this stage, for all data points ∈xi P,
(kNN) classifier is largely dependent on the choice of the number where i 1,= 2, . . . n, the value of g(x i ) is available to the
of nearest neighbors denoted by k. However, given a data set, classifier. Once trained, the classifier is expected to correctly
it is a tedious task to optimize the performance of kNN by tuning
k. Moreover, the performance of kNN degrades in the presence predict the value of g(yi ) for a new data point y∈ i Q (Q X,
of class imbalance, a situation characterized by disparate | |m,
Q = and i 1, 2, . . . , m). This is called the testing
representation from different classes. We aim to address both phase. The data point yi is known as a test or query point,
the issues in this paper and propose a variant of kNN called while the set Q of all such points is called a test set.
the Adaptive kNN (Ada-kNN). The Ada-kNN classifier uses the
density and distribution of the neighborhood of a test point and
The k-Nearest Neighbor (kNN) classifier has always been
learns a suitable point-specific k for it with the help of artificial preferred for its methodical simplicity, nonparametric working
neural networks. We further improve our proposal by replacing principle [1], and ease of implementation. The kNN classifier
the neural network with a heuristic learning method guided by involves tuning of a single parameter k (the number of nearest
an indicator of the local density of a test point and using neighbors to be considered). However, it is not easy to find
information about its neighboring training points. The proposed
heuristic learning algorithm preserves the simplicity of kNN the value of k for which the algorithm performs optimally
without incurring serious computational burden. We call this on a wide range of data sets (or for all the points in the
method Ada-kNN2. Ada-kNN and Ada-kNN2 perform very same data set). Theoretical studies suggest that the number
competitive when compared with kNN, five of kNN’s state-of-the- of points in the training set (say n) and the value of k
art variants, and other popular classifiers. Furthermore, we (1 k≤n) both control the performance of the kNN
propose a class- based global weighting scheme (Global
Imbalance Handling Scheme or GIHS) to compensate for the algorithm [2]. Furthermore, if k=1, then the probability of
effect of class imbalance. We perform extensive experiments on a misclassification will be bounded above by twice the risk of
wide variety of data sets to establish the improvement shown by the Bayes decision rule as n →∞ [2]. However, depending on
Ada-kNN and Ada- kNN2 using the proposed GIHS, when the data set, choices other than k=1 may be more suitable [3].
compared with kNN, and its 12 variants specifically tailored for Therefore, the theory discussed in [2] and [4] does not help
imbalanced classification.
with the choice of k in practical cases. Usually, a global k is
Index Terms— Heuristic learning, imbalanced classification, k- chosen, i.e., a single value of k for classifying all test points.
nearest neighbor (kNN), parameter adaptation, supervised
Conventional choices of such a global k value are 1, 3, 5, 7,
and 9 [5], [6], but may also be as large as k = √n [1], [7].
learning.
B. Motivation
Ada-kNN method1 first finds for each of the training points,
The conventional and widely accepted practice of using a a value of k for which kNN can accurately classify that point.
global k value completely ignores the local data density and Assuming this k value to approximate the local information
class distribution (also referred to as local information in the of the neighborhood of a data point, our algorithm tries to
subsequent text) of a test point’s neighborhood. Majority of estimate a suitable k value for each query point. This gives
the practical data sets have a global density and distribution rise to a nonlinear regression problem, which can be solved by
of classes differing widely from those in the neighborhood a feedforward Multi-Layer Perceptron (MLP) using the Scaled
of a test point. In such data sets, the information of the Conjugate Gradient (SCG) learning algorithm [11] to estimate
locality around a test point often proves to be vital for the k-terrain.
achieving the right classification. In the case of the kNN We further propose a new learning approach that directly
classifier, the locality information around a test point should utilizes the correctly classifying k values of the neighbors
be considered while choosing a suitable k value for it. This to estimate a suitable k value for each query point. We use
would result in data-point-specific choice of k. Since data the nearest neighbor distance as an indicator of the local
points belonging to the same locality are likely to be correctly density for a query point and to proportionally vary the
classified by similar values of k, the optimal k value can be size of its neighborhood. The resultant algorithm is called
modeled as a suitable nonlinear function of the training points Ada-kNN21. Both the proposed methods are compared with
(see Section III-D) called k-terrain. Hence, the proposed Ada- kNN, Linear Discriminant Analysis (LDA) [12], MLP using
kNN classifier, which learns the k-terrain using a supervised Back-Propagation [13] (MLP-BP), MLP using SCG [11]
learner, is likely to perform better than the traditional kNN (MLP-SCG), Decision Tree (DTs) [14], AkNN [15], Extended
and its variants that use a global k value. Nearest Neighbor (ENN) [16], dyn-kNN [17], Fuzzy Analogy-
It is worth noting that Ada-kNN is reliant on the efficacy Based Classification (FABC) [18], and k ∗ Nearest Neighbor
of an additional learning algorithm which further adds to (k∗NN) [19] over 17 data sets (over six data sets for k∗NN
the computational complexity. As the knowledge about the and eight data sets for FABC) of varying properties. Both the
entire nonlinear function of k is not an absolute necessity, proposed methods are observed to perform competitively,
an approximation of the function in the vicinity of a test point with Ada-kNN performing equivalent to the best on most
is likely to suffice. Thus, the number of neighbors, which small- and medium-scale data sets, while Ada-kNN2 generally
should influence the k value of a query point, can also be achieves the best result among all the contenders.
estimated based on an indication of the local data density. The proposed classifiers can tackle the local variations of a
One may then use the correctly classifying k information of data set with the varying value of k, thus dispensing the need
these neighbors (reflective of the local information) to for local weighting schemes on imbalanced data sets. There-
estimate the optimal k value for the query point in question. fore, we define a simple Global Imbalance Handling Scheme
This idea motivated us to replace the supervised learner in (GIHS) and combine it with kNN, Ada-kNN, and Ada-
Ada-kNN with such a heuristic learning scheme, giving rise to kNN2 classifiers1 to observe the performance improvement
the Ada- kNN2 classifier which is also presented in this paper. for imbalanced data sets. We perform extensive experiments
Achieving good classification for imbalanced data sets is a using 22 two-class data sets and 11 multi-class data sets
challenge for the kNN classifier, as it relies on the expectation having varying degrees of imbalance and compare the
that in the neighborhood of a test point, the number of training performance of the proposed algorithms against that of the
points from its ideal class will be high. In reality, this may state-of-the-art variants of kNN for data imbalance [namely
not hold true for a test point belonging to the minority kNN, k Rare class Nearest Neighbor (kRNN) [20], k
class of an imbalanced data set, as most of its neighbors Exampler-based Nearest Neighbor (kENN) [21], k Positive-
may still be from one of the majority classes in spite of biased Nearest Neighbor (kPNN) [22], Class Conditional
an increase in the number of minority neighbors (due to the Nearest Neighbor Distri- bution (CCNND) [23], dyn-kNN
global scarcity of the minority points). In such a scenario, [17], Neighbor-Weighted kNN (NWkNN) [24], WkNN [25]],
the basic kNN classifier, which does not consider the class Ada-kNN and Ada-kNN2
imbalance, will wrongly assign the test point to a majority coupled with under-sampling [26] and over-sampling [27]
class. The class-specific weighted kNN classifier can be techniques. The proposed methods achieve competitive per-
proved to be efficient in such situations. However, while formance compared with the competing algorithms, with Ada-
choosing the weights, the local information should also be kNN2 coupled with GIHS being the best performer in most
considered. In Ada-kNN and Ada-kNN2, since the local cases.
information is accounted for by the point-specific value of k, a
global class- specific weighting scheme may suffice.
D. Organization
Therefore, we present a scheme to combine class-specific
global weights with the proposed classifiers to handle class We arrange the rest of this paper, subsequent to the
imbalance. introduction in Section I in the following manner. In Section
II, we present a brief but comprehensive review of the notable
works done in the field. In Section III, we describe Ada-kNN
C. Contribution and Ada-kNN2, followed by an explanation of GIHS for
The main contributions of this paper are in order. handling imbalance. In Section IV, we first discuss the choice
To facilitate the choice of data-point-specific k, the proposed 1
Codes are available from: https://github.com/SankhaSubhra/Ada-kNN.
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 3
of the data sets and introduce the indices for quantifying The algorithm then assigns to a training point xi , the value
the performance of the classification methods. The rest of of k which performs the best (in terms of global and local
Section IV contains experimental results and a comparative performance) over those training points of which xi is a
discussion of the proposed methods along with their con- neighbor. For a test point, however, the k value corresponding
tenders. We present our conclusion and the future directions to its nearest neighbor is used. Thus, the method uses disparate
of research in Section V. philosophies for assigning k values to the training and test
points. Furthermore, the algorithm suffers from a high time
II. BACKGROUND complexity.
In this section, we briefly describe the notable past works Anava and Levy [19] proposed k∗NN to estimate the class
label of a test point yi Q ∈as weighted average of the class
attempting to aid kNN classifier by incorporating information
labels of training points. They showed that for yi , only a
about the neighborhood of a test point and making it suitable
test point dependent finite number of nearest neighbors will
for handling class imbalance.
contribute to the calculation of its class label. The problem of
finding the point-specific number of neighbors and their cor-
A. Sensitivity to Locality Information of Nearest responding weights can be reduced to a convex optimization
Neighbor Classifiers problem, which can be solved exactly by the greedy method
In the kNN classifier, the local density around a test point proposed by the authors.
should ideally affect the choice of k (see [19, Fig. 1]), Ezghari et al. [18] introduced a new adaptive variant of
resulting in a better strategy for rightly classifying the test fuzzy kNN called FABC, which first transforms the data
point. This idea is employed by Wettschereck and Dietterich points from their original numerical space to a fuzzy
[28] in four of their proposed methods to determine test- membership space. A query point is classified by aggregating
point-specific values of k. The first two methods utilize the its similarities with each of the training points and their
information of the correctly classifying k values of the respective fuzzy membership values for the different classes.
neighbors of a test point. The third technique works by finding
the class-specific k values, which successfully classify a
maximum number of training members from their B. kNN Classifiers for Handling Class Imbalance
corresponding classes. In the fourth method, the test points are The classification accuracy in the presence of class imbal-
clustered, and each of the clusters uses a single k value for ance can be improved in one of the two directions. In the first
classifying all test points belonging to that cluster. direction, the original training set is kept unaltered, while the
Ougiaroglou et al. [29] proposed a dynamic kNN classi- classifier is modified to incorporate immunity against the
fier by modifying the incremental nearest neighbor searching effect of imbalance. Along the second line of thinking, the
algorithm [30]. Their method introduced three heuristics to focus is shifted from the classifier to the training set itself,
reduce the computational complexity of the incremental kNN which is properly pre-processed by either under-sampling (the
search. Wang et al. [15] incorporated an adaptive distance majority class) or over-sampling (the minority class) [10],
measure with kNN, forming an algorithm called AkNN. The [32]–[34].
method finds for each training point x∈i P, the distance rxi 1) kNN Variant Designed Specifically for Class Imbalance
of the nearest neighbor belonging to any other class. While Problem: As a primary approach, Zhang and Mani [35] used
finding the neighbors for a query point y∈i Q, the distance five different methods to under-sample the majority class such
between yi and xi is downscaled by corresponding rxi . that the effect of class imbalance can be alleviated.
Bhattacharya et al. [31] devised an algorithm for assigning Tan [24] proposed a weight selection scheme of the
a test-point-specific value of k. The method is based on weighted kNN method (called NWkNN) to handle imbalance.
neighborhood density [see [31, eq. (1)]] and certainty factor In this technique, each class is assigned a weight depending on
[see [31, eqs. (5) and (6)]]. Tang and He [16] proposed a its global probability and that of the smallest class. However,
variant of the classical kNN decision rule known as the ENN the algorithm introduces an additional data-dependent parame-
method. The key feature of their algorithm is the consideration ter, which needs to be optimized.
of not only the k neighboring training points of the test point Wang et al. [36] modified the extended kNN proposed
in question, but also those training points which include the by Wang and Bell [37] to make it suitable for handling
query point in their k neighborhoods. The authors iteratively imbalance. The authors defined class-specific weights in the
assign a test point to each of the possible classes and calculate form of differences between the local (calculated over the k
a class-specific statistic [see [16, eqs. (4) and (5)]]. That test neighborhood) and global (calculated over P) probabilities,
point is then assigned to the class having the highest value of respectively, for each of the classes.
the statistic. Liu and Chawla [38] developed a weight selection
Another work by Garcia-Pedrajas et al. [17] (hereafter technique for handling class imbalance. They reduced the
called dyn-kNN) approached the task of a test-point-specific problem of kNN to finding the maximum among the
k value by considering both the global and local implications probabilities of a test point yi belonging to each of the classes,
of each choice of k. The algorithm evaluates the global given the k neighborhood of yi . To achieve this, the authors
performance for all k k∈ min,
[ kmax . A training-point-specific
] local used the posterior probabilities of the classes as weights
performance for each of the k values is also maintained. (estimated by mixture models or Bayesian network). Li and
Zhang [21] presented an idea (called kENN) of extending
some of the
4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
minority class members to Gaussian balls, to compensate A. kNN and the Class-Specific Weighted kNN Classifiers
for the disadvantage of scarcity. Such a strategy effectively
reduces the distance between a test point and the minority Let the class label be denoted as c(xi )∈ for a training
class members so that the minority class gets an edge. point xi P,∈ for i 1, 2, . . . , n. The set C of all training
Dubei and Pudi [25] utilized a weighting scheme to aid points belonging to class c is denoted by |C |c = nc .
C c, and
the issue of rare class classification (hereafter called WkNN). The kNNs of a data point x ∈X , from a set S, form the
Their method incorporates a query-specific local weighting set S (x).
k Also, let us take I (.) to be an indicator function
alongside a set of global class-specific weights. However, defined as in (1), where cond is some condition to be satisfied
the method needs to employ additional kNN runs over the = 1, if cond is true
neighborhood of each query point. Kriminger et al. [23] used I (cond) (1)
0, if cond is false.
the nearest neighbor information to estimate the probability
of a class in their algorithm called CCNND. The algorithm We can now describe the basic kNN classification rule, with
for each class finds the probability of a test point belonging the help of these notations. For a test point yi , if the kNNs of
to that class given the distance of its nearest neighbor from yi in the training set P form the set P k(yi ), then the predicted
that class (using empirical cumulative distribution function of class label of yi , denoted as cˆ(yi ) , will be
the intraclass nearest neighbor distances). Zhang and Li [22]
proposed a variant of kNN called kPNN. The idea uses a cˆ(yi ) = arg max I (c(xi ) = c). (2)
dynamic neighborhood (containing at least k minority class c∈1,2,...,C
xi ∈ kP (yi )
members) and classifies based on the estimation of the locally
adjusted posterior probability of the minority class. As indicated by (2), the kNN classifier actually labels a new
Recently, Zhang et al. [20] proposed the kRNN, which uses point yi , as belonging to the class having the maximum
number of members in the set P (yk i ). In the class-specific
a further constrained definition of the dynamic neighborhood
proposed in [22] for each test point. Furthermore, the clas- weighted NN classifier [24], every class is associated with a
k
sifier estimates both the global and local adjusted posterior weight, and the number of points belonging to a certain class
probabilities for the minority class. is multiplied by the weight associated with that class. The
However, the problems with all these methods are the rest is similar to the conventional kNN classification, resulting
involvement of exhaustive search, the introduction of new in the assignment of the new point to the class producing
parameters, and significant computational overhead, which maximum weighted number of members in P k(yi ). If the
hinder the scalability and easy implementation of kNN. weight for the c th class is denoted by wc, then the weighted
2) Removing Imbalance by Pre-Process the Data kNN classification rule can be expressed as
Set: A data set can be pre-processed by under-sampling the
points of the majority classes or over-sampling the points cˆ(yi ) = arg max wc I (c (xi ) = c). (3)
of the minority classes such that the number of representa- c∈1,2,...,C
xi ∈ P
k (yi )
tives from all the classes becomes comparable. A common
The weights in (3) can have any real value greater than 0.
method of under-sampling is called Random Under-Sampling
The weighted kNN classifier is actually a more general form
(RUS) [26], which, as the name suggests, randomly discards
of the kNN classifier, which can be reduced to the basic
the representatives of the majority class from the training set,
kNN classifier by making all the weights equal. In weighted
until the effect of imbalance is sufficiently mitigated, whereas
kNN classification, the user enjoys the flexibility to introduce
a notable approach for over-sampling known as Synthetic
problem-specific class priorities, which can achieve better
Minority Over-Sampling Technique (SMOTE) [27] randomly
results.
creates new minority class points by interpolating between
the existing minority points and their neighbors. However,
the original training set characteristics are not entirely pre- B. Description of Ada-kNN
served through any of these techniques. In addition, both
The Ada-kNN classifier utilizes the information of the
methods introduce new parameters, which need to be properly
neighborhood of a test point and mitigates the problem of
tuned.
global choice of k, by introducing the concept of point-
specific k value. One way to do this is by finding the sets of k
III. P ROPOSED M ETHODS values correctly classifying each of the training points∈xi P and
employing a supervised learner to model the k values as a
Starting with some basic notations which we follow
function of their corresponding training vectors. This function
throughout this paper, we first describe the canonical and
can then be applied to a test point to identify a k value suitable
weighted kNN classification algorithms. Next, we propose the
for it.
Ada-kNN classifier and extend it to Ada-kNN2 using a heuris-
The Ada-kNN algorithm starts by finding the set of suc-
tic learning algorithm for point-specific k selection. cessful k values, K x i for each of the training points x∈i P.
Thereafter, we present GIHS and use it to augment Ada-kNN
One may perform an exhaustive search over the set of all
and Ada- kNN2 for imbalance classification problems.
possible k values, i.e., K 1, 2, . . . , kmax , to find the set Kx i
Finally, we finish the section with the computational = { }
xi P.
∀ Here,
∈ kmax n, which √
= satisfies the following desirable
complexity analysis of the proposed algorithms. properties.
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 5
α elements of Krand and record all the correctly classifying (yi ). A random mode can be selected if yi is multimodal.
ν
k values in a set K x i (K x i K xi ). Only when all the α Now, we need to find a suitable value for ν. It is reasonable
trials are a failure, we look⊆for the first correctly classifying to assume that a test point lying in a dense region has
value similar characteristics with a larger number of training set
of k among the subsequent members of Krand. At the time of neighbors, whereas a test point residing in a sparse location
training the MLP, one can randomly choose a value of has similarities with only a small number of neighbors.
k xi ∈ K x i . Therefore, the number of neighbors should be varied
In the test phase of Ada-kNN, the trained MLP can be fed according to the local density in the vicinity of the test
with a query point yi ∈ Q to obtain the value of kyi . point. Therefore, ν
(hereafter called νyi for a test point yi Q)∈ should be a test
C. New Learning Method and Ada-kNN2 point-dependent parameter instead of a global constant. A
We use an MLP in the primary implementation of Ada- direct way to find an optimal value of νyi is to use the density
kNN to model the nonlinear function of correctly clas- sifying of the locality surrounding yi . However, instead of calculating
k values of the training points. However, a learning algorithm or approximating the actual local density, one may use an
easily computable indicator of it, such as the nearest neighbor
like MLP involves a number of parameters (the number of
distance. This claim is supported by the following
hidden layers, the number of hidden nodes in each layer,
theorem.
learning rate, and a choice of the training algorithm).
Theorem 1: Let a D-dimensional hypersphere contain some
Moreover, MLP incurs a high algorithmic complexity. Both
uniformly distributed points (including one lying at the
of these issues affect the simplicity of Ada-kNN. Furthermore,
center). Let the expected distances of the nearest neighbor
MLP uses single k xi ∈ K x i , while the set may contain at most from the
α elements, reducing the utilization of available information.
center be dN1 and dN2 , respectively, when the hypersphere
This motivated us to design a simple parameter-free heuristic
contains N1 and N2 number of points. If N1 > N2, then it can
learning algorithm. be shown that dN1 < dN2 .
The basic idea of the proposed heuristic scheme is based Proof: The proof follows from the derivation of Bhat-
on the assumption that a test point will have similar properties tacharyya and Chakrabarti (see [39, Sec. III]), which we detail
as its training set neighbors. Let us consider a k value, which in the Supplementary Material.
correctly classifies some of the training set neighbors of a From Theorem 1, it can be seen that the local density
test instance yi ∈Q. Such a k value can then be expected to bears an approximately inverse relation to the expected
correctly predict the class label of yi with a certain probability. nearest neighbor distance. Therefore, if the nearest neighbor
The heuristic attempts to find the value of k for which such a distance is large, we can assume a sparse locality, whereas a
probability is maximized over the neighborhood of yi . low distance will indicate a dense region. Let us denote the
To define the problem more formally, let us assume that the S as d S
distance from a point x to its nearest neighbor in a set 1NN
neighborhood of yi contains ν number of points. Furthermore,
∈
1NN
(x). Therefore, for a point yi Q, the nearest training set neighbor is at a distance d P (y i). Let us denote
consider a family of functions F = { f1, f2, ... fkmax }. Each dmin = min d P\{xi }(xi ) and dmax = max d P\{xi }(xi ).
of the fk ∈ F is a many-to-one mapping from a data set X to xi ∈
1NN
xi ∈
1NN
P P
6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
, (8)
(1 − νmax) LIST OF Kxi AND RANDOMLY SELECTED kxi FROM Kxi , FOR EACH OF THE
ν (y ) = (y ) − + TRAINING POINTS xi ∈ P IN THE EXAMPLE
d ν
Pd
lin i 1NN i min max
(dmax −
and dmin)
having
loge νmax
β =− dmax − dmin . (10)
Fig. 2. Training points and their corresponding values of k from the example, illustrating the estimated k-terrain and predicted k values for the test points.
point, the optimal k value cannot be directly estimated from F. Complete Algorithm
it. However, this information can be used to train an MLP to The complete algorithms for Ada-kNN and Ada-kNN2,
approximate the nonlinear function of k (referred as k-terrain). with the addition of GIHS, are schematically shown,
As MLP uses random kxi ∈ Kx i for a training point xi , respectively, in Algorithms 1 and 2. If we do not wish to
the learned k-terrain is not unique. One of such possible k- consider data imbalance, we can skip employing GIHS by
terrain values is shown in Fig. 2. To illustrate the efficiency of initializing the weight vector W with all 1s, in which case the
our proposed classifiers, we take a test set Q 2, 5=, {where } two algorithms reduce to regular Ada-kNN and Ada-kNN2.
c(2) 1 and=c(5) 2. If we use Ada-kNN, then we take the Both of the algorithms use a subroutine called TrainRoutine(
ceiling of the k-terrain values corresponding to the test point.
P, W, K, xi ), which is described in Procedure 1, for
Following this process, one may obtain k2 = 1
finding the set K x i .
and k5 = 3. In the case of Ada-kNN2, we find dmin = 0.1,
Hereafter, we will denote the i th member of a set S as Si . The
function Concat(si |i = 1, 2, . . . n) takes values s1, s2, . . .
sn
= d (2) = d (5) 0.2. = ν2 1
P P
dmax 0.9, 1NN
0.1, and 1NN
Thus, and concatenates them to form a multiset. The subroutine
and ν5 =2, which gives = 2 ={ } 1 and 5 =
1, {2, 3, 3, 4}. Mode(S) takes a multiset S as input and produces a mode of
Taking the mode from 2 and 5 gives k2 1 and k5=3, respectively, S as output (resolving ties randomly in case S is multimodal).
similar to that of Ada-kNN. One can verify using P and (2) The function Rand(S) takes a set S as input and outputs a
that k2 and k5 can, indeed, correctly classify the corresponding random element of S, while RandPerm(S) returns a random
test points. permutation of the elements of set S.
Procedure 1 TrainRoutine( P , W , K , xi )
E. Handling Imbalance in a Data Set
Initialize: K r and ← RandPerm(K ), K x φ.
As discussed in Section I-B, the class-specific weighted K x i ← all k among the first α elements of K r and that
kNN classifier can be proved to be efficient in situations successfully classifies xi using the training set P \ {x}i .
plagued by class imbalance. The class-specific weights can if K x =i φ then
be used to magnify the increments in the number of members K x i ← first k among the rest of the elements of K r and that
from the minority classes in the neighborhood of a test point successfully classifies xi using the training set P \{ x}i .
while diminishing those of the majority classes to compensate end if
for their abundance. return K x i
One way to weigh the classes is to use a global class-
weighting scheme, which uses the same set of class weights
for all test points. Since there should ideally be equal number G. Algorithmic Complexity of Ada-kNN and Its Variants
of representatives from each class in P, the ideal probability
of a point belonging to class c∈ Cshould be r = (1/C). But In this section, we analyze the computational complex-
most data sets are not ideal, and thus, we have nc training ities of the proposed classifiers. In Theorem 2, we ana-
points belonging to the class c (where nc values are not all lyze the asymptotic complexity of the regular kNN classi-
comparable). Hence,
C
nc = n, and the prior probability fier, which we then utilize to examine the complexities of
Ada-kNN, Ada-kNN2, and their GIHS-coupled versions in
of class c is p c/ )
c=1 c
n . In order for = (n
each class to have a
fair chance in the neighborhood of a test point irrespective of Theorems 3–5, respectively. Proofs of Theorem 2–5 can be
class imbalance, we assign the ratio of the ideal and current found in the Supplementary Material.
probabilities for a class c as the global weight wc, associated Theorem 2: The asymptotic time complexity for classifying
with that class, i.e., a set of points by the kNN classifier is O(n3).
Theorem 3: The asymptotic time complexity for classifying
r a set of points by the Ada-kNN classifier is O(max{ n2.5, nψ }),
wc = c. (11)
p
where the supervised k learner takes O(nψ ) time.
8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
TABLE IV
COMPARISON OF CLASSIFIERS IN TERMS OF GM ON TWO-CLASS IMBALANCED DATA SETS
TABLE V
COMPARISON OF CLASSIFIERS IN TERMS OF AUROC ON TWO-CLASS IMBALANCED DATA SETS
with tunable parameters, such as SMOTE or RUS) for each medium-scale and 6 large-scale) data sets in terms of acc
data set differs significantly from the competitors with atleast in Table II. An inspection of the results on the small/medium-
5% level of significance. The result for this test is summarized scale data sets reveals that Ada-kNN2 achieved the best mean
(detailed in the Supplementary Material) in terms of Wins acc and ARSMS. The Ada-kNN algorithm managed to attain
(W), Ties (T), and Losses (L). No statistical test was the fifth place both in terms of average acc and ARSMS.
undertaken while comparing the performance of Ada-kNN and However, its performance is statistically comparable to that of
Ada-kNN2 Ada-kNN2 on 10 data sets.
with k∗NN and FABC, as the results for the latter contenders In the case of the large-scale data sets, Ada-kNN2 achieved
are, respectively, quoted from [18] and [19]. the second best average acc slightly lower than that of MLP-
SCG. However, Ada-kNN2 achieved the best ARLS, which
D. Choice of Parameter α for Ada-kNN and Ada-kNN2 confirms its better consistency and scalability. Ada-kNN,
A suitable large choice of α will produce a good coverage on the other hand, exhibits lower competence on the large-
of K x i for a point x∈i P, but will also increase the computational scale data sets, as is evident from the high ARLS and low
complexity. A small value of α, on the other hand, will reduce average acc. Ada-kNN failed to achieve good performance on
the computational complexity, but may increase the chance the large-scale data sets, possibly due to insufficient learning
of overfitting by decreasing the noise immunity. Furthermore, of the k-terrain by the underlying MLP, leading to erroneous
it is observed by us during experiments that tuning the value prediction of the kyi values.
of α does not significantly improve the results, suggesting The comparison between Ada-kNN, Ada-kNN2, and FABC
very little data dependence. These factors lead us to take a (over eight data sets) is summarized in the first three rows
constant value of α = 10 for all our experiments. of Table III, and it shows that Ada-kNN2 achieved the best
AR. Ada-kNN and FABC both achieved the same AR with
Ada-kNN attaining the higher average acc. The comparison
E. Performance of Ada-kNN and Ada-kNN2 Compared With
Other State-of-the-Art Classifiers in Terms of acc between the two proposed classifiers and k∗NN (on six data
sets) is summarized in the last three rows of Table III (the
We summarize the results of Ada-kNN, Ada-kNN2, and original study introducing k∗NN [19] used misclassification
the eight other competing algorithms over 17 (11 small/
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 11
TABLE VI
COMPARISON OF CLASSIFIERS IN TERMS OF AURPC ON TWO-CLASS IMBALANCED DATA SETS
TABLE VII
COMPARISON OF CLASSIFIERS IN TERMS OF GM ON MULTI-CLASS IMBALANCED DATA SETS
TABLE VIII
COMPARISON OF CLASSIFIERS IN TERMS OF AUROC ON MULTI-CLASS IMBALANCED DATA SETS
error as index, which we convert to acc for the sake of for the small/medium-scale data sets reveals that Ada-kNN2 +
consistency). Ada-kNN achieved both the best average acc GIHS, Ada-kNN + GIHS, and kPNN outperformed the other
and rank in this case, followed by Ada-kNN2. contenders. Ada-kNN2 + GIHS achieved the best average
These findings imply that the proposed classifiers outper- index value as well as ARSMS for both GM and AUROC.
form the state of the art on small/medium-scale data sets kPNN achieved slightly better index value and ARSMS com-
due to the adaptive choice of k values. Ada-kNN2 is also pared with Ada-kNN + GIHS for AURPC. However, the
able to retain the good performance on the large-scale data W-T-L counts for AURPC point toward the greater
sets, unlike Ada-kNN which seems to suffer from scalability consistency of +Ada-kNN2 GIHS. Therefore, the Ada-kNN2
issues. GIHS classifier seems to be the best choice
for this type of data sets.
F. Performance of Ada-kNN + GIHS and Ada-kNN2
GIHS on Two-Class Imbalanced Data Sets in Terms
On the seven high-dimensional data sets, Ada-kNN2 +
GIHS, kNN + GIHS, and kPNN achieved relatively bet-
of GM, AUROC, and AURPC
ter performance than the other competitors. Though kPNN
In Tables IV–VI, we summarize the results in terms of GM, achieved better average index values, the lower ARLS
AUROC, and AURPC for 15 small/medium and 7 large-scale
+
attained by Ada-kNN2 GIHS on GM as well as
two-class imbalanced data sets. An inspection of the results AUROC implies the greater consistency of the latter. The W-
T-L counts,
12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
TABLE IX
COMPARISON OF CLASSIFIERS IN TERMS OF AURPC ON MULTI-CLASS IMBALANCED DATA SETS
which report statistically comparable AURPC performance for V. C ONCLUSION AND F UTURE WORK
Ada-kNN2 GIHS against kNN GIHS as well as kPNN,
+ We now summarize the various key findings of this paper
further attest to this fact. The slightly poorer performance of
Ada-kNN2 GIHS may be due to the increasing bias of the and present our concluding remarks. The proposed adaptive
+ choice of k values for the kNN classifier can prove to be
index toward the accuracy on the majority class with rising
IR, as shown by an example in the Supplementary Material. very useful for data classification over the conventional global
Hence, we conclude that Ada-kNN2 GIHS remains com- choices (such as k 1). = This is because such adaptive
+ of data sets as well.
petitive to the state of the art on this type choice of k can account for the properties of the neighborhood
Thus, Ada-kNN2 GIHS can outperform the state-of-the- of the test point, which may remain ignored by a single
+
art imbalanced classifiers over two-class imbalanced data sets global choice of k. The advantage of such an approach is
of varying scale owing to the adaptive choice of k augmented evident from the consistently commendable performance of
with GIHS. Ada-kNN and Ada-kNN2. GIHS is a simple weighting scheme
incurring constant time complexity, which does not involve
any additional parameters. Furthermore, in contrast to the pre-
G. Performance of Ada-kNN+GIHS and Ada-kNN2 process techniques used for imbalance handling, GIHS has
GIHS on Multi-Class Imbalanced Data Sets in Terms an additional advantage of preserving the class distributions
of GM, AUROC, and AURPC of the data set. GIHS is observed to further augment the
We report the summary of results in terms of GM, performance of Ada-kNN and Ada-kNN2 in the presence of
AUROC, and AURPC, for the 12 contending classifiers on class imbalance.
six small/medium and five large-scale multi-class imbalanced The MLP-based proposal (Ada-kNN) for adaptively choos-
data sets in Tables VII–IX. On small/medium-scale data sets, ing k values seems to suffer from scalability issues on large-
Ada-kNN2 + GIHS and kNN GIHS emerge as the top per- scale data sets. On the other hand, the proposed heuristic
formers. Ada-kNN2 + GIHS achieved better index values as counterpart (Ada-kNN2) is immune to such scalability issues
well as ARSMS for all three performance evaluating indices. as evident from its admirable performance on small as well
Among the contenders, kNN+GIHS demonstrated the second as large-scale data sets. Moreover, the improved performance
best performance in terms of both the ARMS and the average of Ada-kNN2 GIHS + is observed to be resilient to the
index value for all the indices. Therefore, Ada-kNN2+ GIHS increase in IR (see the Supplementary Material). An
is found to be a good choice for this type of data sets. interesting future extension of this paper may be to replace the
Ada-kNN2 + GIHS and kNN GIHS performed better MLP in Ada-kNN with a more scalable supervised learning
than the other contenders in terms of GM and AUROC for the algorithm. Another future direction should be to investigate
large-scale data sets. This indicates that both of these methods how different distance measures affect the performance of
were able to achieve equivalent classwise accuracies for all Ada-kNN and Ada-kNN2. Specifically, it may be interesting
the classes present in a data set. However, the performance of to study the effects (of the choice of distance measure) in the
Ada-kNN2+ GIHS in terms of AURPC lags behind those of case of high- dimensional data sets, where Euclidean distance
other methods, such as kNN, NWkNN, and dyn-kNN, due to is known to perform poorly [53].
the bias of the index as explained for the two-class imbalanced
case. Therefore, Ada-kNN2+ GIHS should be the method R EFERENCES
of choice for this type of data sets. Furthermore, the kNN
[1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.
+ GIHS classifier is able to retain the good performance in Hoboken, NJ, USA: Wiley, 2000.
terms of AURPC as well. This points toward the efficacy of [2] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
the simple, parameter-free GIHS proposal. Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967.
[3] P. Hall, B. U. Park, and R. J. Samworth, “Choice of neighbor
Hence, one should employ Ada-kNN2+ GIHS for multi- order in nearest-neighbor classification,” Ann. Statist., vol. 36, no. 5,
class imbalanced data sets of varying scale, which further pp. 2135–2152, 2008.
indicates the usefulness of adaptive choice of k and the [4] D. O. Loftsgaarden and C. P. Quesenberry, “A nonparametric estimate
proposed weighting scheme. of a multivariate density function,” Ann. Math. Statist., vol. 36, no. 3,
pp. 1049–1051, 1965.
MULLICK et al.: ADAPTIVE LEARNING-BASED K NN CLASSIFIERS WITH RESILIENCE TO CLASS IMBALANCE 13
[5] C.-L. Liu and M. Nakagawa, “Evaluation of prototype learning algo- [30] G. Hjaltason and H. Samet, “Distance browsing in spatial databases,”
rithms for nearest-neighbor classifier in application to handwritten ACM Trans. Database Syst., vol. 24, no. 2, pp. 265–318, 1999.
character recognition,” Pattern Recognit., vol. 34, no. 3, pp. 601–615, [31] G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, “A probabilistic
2001. framework for dynamic k estimation in kNN classifiers with certainty
[6] C. Domeniconi, J. Peng, and D. Gunopulos, “Locally adaptive metric factor,” in Proc. 8th Int. Conf. Adv. Pattern Recognit., Jan. 2015, pp. 1–5.
nearest-neighbor classification,” IEEE Trans. Pattern Anal. Mach. [32] P. Branco, L. Torgo, and R. P. Ribeiro, “A survey of predictive mod-
Intell., vol. 24, no. 9, pp. 1281–1285, Sep. 2002. eling on imbalanced domains,” ACM Comput. Surv., vol. 49, no. 2,
[7] G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, “An affinity-based pp. 31:1–31:50, 2016.
new local distance function and similarity measure for kNN algorithm,” [33] V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An
Pattern Recognit. Lett., vol. 33, no. 3, pp. 356–363, 2012. insight into classification with imbalanced data: Empirical results and
[8] C. C. Holmes and N. M. Adams, “A probabilistic nearest neighbour current trends on using data intrinsic characteristics,” Inf. Sci., vol. 250,
method for statistical pattern recognition,” J. Roy. Statist. Soc. Ser. B, pp. 113–141, Nov. 2013.
Statist. Methodol., vol. 64, no. 2, pp. 295–306, 2002. [34] B. Krawczyk, “Learning from imbalanced data: Open challenges and
[9] A. K. Ghosh, “On optimum choice of k in nearest neighbor classifi- future directions,” Prog. Artif. Intell., vol. 5, no. 4, pp. 221–232, 2016.
cation,” Comput. Statist. Data Anal., vol. 50, no. 11, pp. 3113–3123, [35] J. Zhang and I. Mani, “kNN approach to unbalanced data distributions:
2006. A case study involving information extraction,” in Proc. ICML
[10] Y. Sun, A. K. Wong, and M. S. Kamel, “Classification of imbalanced Workshop Learn. Imbalanced Datasets, 2003, pp. 1–7.
data: A review,” Int. J. Pattern Recognit. Artif. Intell., vol. 23, no. 4, [36] L. Wang, L. Khan, and B. Thuraisingham, “An effective evi-
pp. 687–719, 2009. dence theory based k-nearest neighbor (kNN) classification,” in Proc.
[11] M. F. Møller, “A scaled conjugate gradient algorithm for fast supervised IEEE/WIC/ACM Int. Conf. Web Intell. Intell. Agent Technol., Dec. 2008,
learning,” Neural Netw., vol. 6, no. 4, pp. 525–533, Nov. 1993. pp. 797–801.
[12] R. A. Fisher, “The use of multiple measurements in taxonomic prob- [37] H. Wang and D. Bell, “Extended k-nearest neighbours based on
lems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188, 1936. evidence theory,” Comput. J., vol. 47, no. 6, pp. 662–672, 2003.
[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- [38] W. Liu and S. Chawla, “Class confidence weighted kNN algorithms
resentations by back-propagating errors,” Nature, vol. 323, no. 6088, for imbalanced data sets,” in Proc. 15th Pacific-Asia Conf. Adv. Knowl.
pp. 533–536, 1986. Discovery Data Mining, 2011, pp. 345–356.
[14] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA, [39] P. Bhattacharyya and B. K. Chakrabarti, “The mean distance to the
USA: Morgan Kaufmann, 1993. nth neighbour in a uniform distribution of random points: An appli-
[15] J. Wang, P. Neskovic, and L. N. Cooper, “Improving nearest neighbor cation of probability theory,” Eur. J. Phys., vol. 29, no. 3, p. 639,
rule with a simple adaptive distance measure,” Pattern Recognit. Lett., 2008.
vol. 28, no. 2, pp. 207–213, 2007. [40] M. Lichman. (2013). UCI Machine Learning Repository. [Online].
[16] B. Tang and H. He, “ENN: Extended nearest neighbor method for Available: http://archive.ics.uci.edu/ml
pattern recognition,” IEEE Comput. Intell. Mag., vol. 10, no. 3, pp. 52– [41] G. Rätsch. (2001). IDA Benchmark Repository. [Online]. Available:
60, Aug. 2015. http://ida.first.fhg.de/projects/bench/benchmarks.htm
[17] N. García-Pedrajas, J. A. D. Castillo, and G. Cerruela-Garcia, “A [42] I. Triguero et al., “KEEL 3.0: An open source software for multi-stage
proposal for local k values for k-nearest neighbor rule,” IEEE Trans. analysis in data mining,” Int. J. Comput. Intell. Syst., vol. 10, no. 1,
Neural Netw. Learn. Syst., vol. 28, no. 2, pp. 470–475, Feb. 2015. pp. 1238–1249, 2017.
[18] S. Ezghari, A. Zahi, and K. Zenkouar, “A new nearest neighbor classi- [43] R. Akbani, S. Kwek, and N. Japkowicz, “Applying support vector
fication method based on fuzzy set theory and aggregation operators,” machines to imbalanced datasets,” in Proc. Eur. Conf. Mach. Learn.,
Expert Syst. Appl., vol. 80, pp. 58–74, Sep. 2017. 2004, pp. 39–50.
∗
[19] O. Anava and K. Levy, “k -nearest neighbors: From global to local,” in [44] I. Guyon. (2006). Datasets for the Agnostic Learning vs. Prior
Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 4916–4924. Knowledge Competition. [Online]. Available: http://www.agnostic.
[20] X. Zhang, Y. Li, R. Kotagiri, L. Wu, Z. Tari, and M. Cheriet, “KRNN: inf.ethz.ch/datasets.php
K rare-class nearest neighbour classification,” Pattern Recognit., vol. 62, [45] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vec-
pp. 33–44, Feb. 2017. tor machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3,
[21] Y. Li and X. Zhang, “Improving k nearest neighbor with exemplar pp. 27:1–27:27, 2011.
generalization for imbalanced classification,” in Proc. 15th Pacific-Asia [46] M. Kubat and S. Matwin, “Addressing the curse of imbalanced training
Conf. Adv. Knowl. Discovery Data Mining, 2011, pp. 321–332. sets: One-sided selection,” in Proc. 14th Int. Conf. Mach. Learn., 1997,
[22] X. Zhang and Y. Li, “A positive-biased nearest neighbour algorithm for pp. 179–186.
imbalanced classification,” in Advances in Knowledge Discovery and [47] M. A. Maloof, “Learning when data sets are imbalanced and when costs
Data Mining. Berlin, Germany: Springer, 2013, pp. 293–304. are unequal and unknown,” in Proc. Int. Conf. Mach. Learn., 2003,
[23] E. Kriminger, J. C. Príncipe, and C. Lakshminarayan, “Nearest neighbor pp. 1–5.
distributions for imbalanced classification,” in Proc. Int. Joint Conf. [48] D. J. Hand and R. J. Till, “A simple generalisation of the area under the
Neural Netw., 2012, pp. 1–5. ROC curve for multiple class classification problems,” Mach. Learn.,
[24] S. Tan, “Neighbor-weighted k-nearest neighbor for unbalanced text vol. 45, no. 2, pp. 171–186, 2001.
corpus,” Expert Syst. Appl., vol. 28, no. 4, pp. 667–671, 2005. [49] J. Davis and M. Goadrich, “The relationship between precision-recall
[25] H. Dubey and V. Pudi, “Class based weighted k-nearest neighbor over and ROC curves,” in Proc. 23rd Int. Conf. Mach. Learn., 2006,
imbalance dataset,” in Advances in Knowledge Discovery and Data pp. 233–240.
Mining. Berlin, Germany: Springer, 2013, pp. 305–316. [50] T. Raeder, T. R. Hoens, and N. V. Chawla, “Consequences of variability
[26] N. Japkowicz, “The class imbalance problem: Significance and strate- in classifier performance estimates,” in Proc. IEEE 10th Int. Conf. Data
gies,” in Proc. Int. Conf. Artif. Intell. (ICAI), 2000, pp. 111–117. Mining (ICDM), Dec. 2010, pp. 421–430.
[27] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, [51] T. Raeder, G. Forman, and N. V. Chawla, “Learning from
“SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. imbalanced data: Evaluation matters,” in Data Mining: Founda-
Res., vol. 16, no. 1, pp. 321–357, 2002. tions and Intelligent Paradigms. Berlin, Germany: Springer, 2012,
[28] D. Wettschereck and T. G. Dietterich, “Locally adaptive nearest pp. 315–331.
neighbor algorithms,” in Proc. Adv. Neural Inf. Process. Syst., 1994, pp. [52] M. Hollander and D. A. Wolfe, Nonparametric Statistical Methods.
184–191. Hoboken, NJ, USA: Wiley, 1999.
[29] S. Ougiaroglou, A. Nanopoulos, A. N. Papadopoulos, Y. Manolopou- [53] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the sur-
los, and T. Welzer-Druzovec, “Adaptive k-nearest-neighbor classifica- prising behavior of distance metrics in high dimensional space,”
tion using a dynamic number of nearest neighbors,” in Advances in in Proc. 8th Int. Conf. Database Theory, London, U.K., 2001,
Databases and Information Systems. Berlin, Germany: Springer, 2007, pp. 420–434.
pp. 66–82.