Documente Academic
Documente Profesional
Documente Cultură
METHOD
We are interested in problem of learning a weight function to enhance the performance of the
NN algorithm in imbalanced data sets. First, we defined a novel and flexible weight function and
then a learning process is proposed to learn parameters of the proposed weight function.
Kami tertarik untuk mempelajari fungsi bobot untuk meningkatkan kinerja algoritma NN dalam
kumpulan data yang tidak seimbang. Pertama, kami mendefinisikan fungsi bobot baru dan
fleksibel dan kemudian sebuah proses pembelajaran diusulkan untuk mempelajari parameter
fungsi bobot yang diusulkan.
(6)
where ={C1,C2,B1,B2} is the weight functions parameters set and G is total number of classes
(usually G=2). The ACC function on gth class, which means per-class accuracy, is defined as:
(7)
(8)
where n
g
is number of samples in gth class and p=, p are, respectively, the same-class and
different-class nearest neighbor of x. A gradient ascent procedure is proposed to maximize the J-
function. This needs J to be differentiable with respect to all parameters; while the step function
(eq. 7) is not differentiable. To do this, the step function is approximated by a sigmoid function,
defined as:
(9)
Using eq. (9), the ACC function becomes:
(10)
(11)
The b-parameter has a smoothing effect; if b is large then S
b
(z) is an accurate approximation of step(z). The
derivation of sigmoid function, which will be need throughout the paper, is simple and calculated
according itself:
(12)
The derivation of sigmoid function acts as windowing function which will be maximized for
z=1; this window will be decreased by decreasing the b-parameter (i.e. the derivation of sigmoid
function approaches Dirac delta function if b-parameter is large). Correspondingly, we can
derive from eq. (6):
(14) YW Y
(15)
. Similar to eq. (13), the derivation of J-function with respect to
other weight function parameters can simply be derived.
Based on the parameters derivations, an iterative Leave-One- Out gradient ascent procedure is
proposed in figure 3 to learn parameters of the weighting functions. During each iteration, one
sample of training data is selected as test data and the rest as train set. Then, the parameters will
be updated using delta rule:
(16)
where is desired parameter and is a learning rate adjusted empirically.
Proposed Learning Algorithm Input: T: Training set, : Learning rates.
b: Sigmoid slope, a: Small constant. Output: !"#"$!"$#
Initialize !"#"$!"$# Set !"]^ ! #"]^ # $!"]^ $! $#"]^ $#; While (terminator condition is not
reached) For all 9 2 H= Q 7
]
\
_
`
@ FF'9" !" #"$!"$#( HJ Q 7
_bb
@ FF'9" !" #"$!"$#( same=indexOf(H=); diff=indexOf(HJ); c==classLabelOf(H=);
cJ=classLabelOf(HJ); W
L
_'L"d _'L"df(
e
(
<
/VU
g
'h
'j..
i
(!/U
k
g
'h
i
(% ( For j=1 to j=N
f
!"]^
!"]^ `<
lh l.
,im
#"]^
#"]^ `<
lh l.
,in
$
!"]^ $
!"]^ `<
lh l+
,im
$
#"]^
$
#"]^
`<
l+ lh
, i n //End of For //End of For Set ! !"]^" # #"]^, $! $!"]^, $#"]^ $#; //End of while Return !"
#"$!"$#
FIG. 3. Proposed learning algorithm.
IV.E
XPERIMENTS This section shows the experimental results and their associated analysis in the
imbalanced data sets. In the first experiment, we used 24 highly imbalanced data sets from the
UCI machine learning repository where there are no more than 10% positive instances in the
whole data set compared to the
negative class. The selected data sets are different in number of samples, features and degree of
data-balance. The main characteristics of these data sets are summarized in Table II.
T
ABLE
II. UCI I
MBALANCED DATA SETS
.
Data set Instance Feature + Instance - Instance
Glass5 214 9 9 205
shuttle2vs4 129 9 6 123
abalone918 731 8 49 682
ecoli4 336 7 20 316
Glass4 214 9 13 201
ecoli034vs5 300 7 20 280
ecoli0146vs5 280 7 20 260
ecoli0147vs56 332 7 25 307
Glass2 214 9 17 197
glass0146vs2 205 9 17 188
ecoli01vs5 240 7 20 220
glass06vs5 108 9 9 99
ecoli0147vs23 56
336 7 29 307
ecoli067vs5 220 7 20 200
Vowel0 988 13 90 898
ecoli0347vs56 257 7 25 232
ecoli0346vs5 205 7 20 185
glass04vs5 92 9 9 83
ecoli0267vs35 224 7 22 202
ecoli01vs235 244 7 24 220
ecoli046vs5 203 7 20 183
glass015vs2 172 9 17 155
ecoli067vs35 222 7 22 200
yeast2vs4 514 8 51 463
Table III shows the G-mean results obtained by applying different methods on the data sets.
The 1-NN classifier is used for the data-oriented methods (the 1-parameter is adjusted manually
for each method). To have an accurate results, we used 10-Times 10-Folds cross validation
which is common in evaluation of classifiers performance. At any time, the entire data set is
divided to 10 bloc1s. Then, one bloc1 is selected as test set and the rest are selected as train set;
this process continues until all of the bloc1s are selected as test set. It is repeated 10 times,
therefore, the final results are the average of the 100 different experiments. For each data set, the
learning rate (`) and Z-parameter are selected from values in 3Po*"Po4" p "Poq5 empirically. To
accomplish that, for each possible combination of`andZ, we used 10-Times 10-Folds cross
validation and chose the best combination in terms of G- mean criteria.
Also, the Friedman-a nonparametric statistical method for testing whether all the algorithms
are equivalent over various
34
data set-is used to find significant differences among the results obtained by the studied methods
[26]. To accomplish this, the average ran1 is calculated using the Friedman test. For a specific
data set, the algorithm that achieves the highest performance measure value is ran1ed at the first
position and its ran1 value is set as one. The algorithm that achieves the second highest value is
given a ran1 value of two, and so forth. Finally, the average ran1 of each algorithm is computed
for comparison.
TABLE
III. THE G-MEAN RESULTS OBTAINED BY 10-TIMES 10-FOLDS CROSS VALIDATION.
Data Sets
[11] [12] [16] 1NN LPD [22] Propo
sed
Glass5 68.76 87.15 49.75 71.26 53.94 88.04 92.13 shuttle2vs 4
99.57 83.38
60.00 90.00 96.44
100 100
abalone91 8
62.55 71.00
05.00 28.05 14.92
62.57 69.41
ecoli4 90.14 92.26
71.88 84.84 87.47
90.42 96.13 Glass4 88.47 81.40
43.65 77.87 73.57
88.71 95.15 ecoli034vs 5
97.07 88.29
90.53 83.05 91.81
97.07 97.71
ecoli0146 vs5
87.95 87.38
82.04 83.90 86.01
87.37 93.88
ecoli0147 vs56
85.80 89.19
75.14 83.29 85.19
88.13 91.07
Glass2 39.05 51.12
00.00 33.40 06.69
64.06 67.35 glass0146 vs2
41.98 55.81
00.00 35.54 06.37
56.04 71.08
ecoli01vs5 84.67 90.79
87.59 84.77 85.65
90.36 91.36 glass06vs 5
87.95 79.18
49.49 89.05 59.10
88.94 92.40
ecoli0147 vs2356
80.27 85.31
59.79 79.58 82.29
83.15 93.43
ecoli067vs 5
84.23 83.19
62.25 79.02 82.37
86.65 89.45
Vowel0 100 94.21
97.66 100 97.95
100 96.97 ecoli0347 vs56
83.97 85.20
71.56 83.37 85.26
86.90 90.51
ecoli0346 vs5
87.81 84.23
77.74 83.29 85.39
90.02 89.14
glass04vs 5
90.00 84.31
68.01 70.57 84.98
99.35 100
ecoli0267 vs35
76.08 77.11
65.86 81.57 80.75
85.20 92.17
ecoli01vs2 35
82.76 84.23
73.12 85.62 77.94
86.77 94.72
ecoli046vs 5
88.00 87.22
76.95 85.37 84.52
87.51 91.80
glass015v s2
42.96 44.46
07.07 33.77 05.25
53.15 60.82
ecoli067vs 35
68.09 79.99
45.42 78.54 75.04
79.43 82.16
yeast2vs4 85.38 87.46 66.16 81.63 76.67 90.87 91.01 Average Rank
3.93 3.66 6.66 4.95 5.08 2.37 1.31
For more examination, we performed another experiment on some different data sets ta1en
from the UCI, StatLib1 and agnostic vs. prior competition2. As mentioned in Ref. [27], AUC-PR
(area under precision-recall curve) is better than Area Under ROC Curve since a curve dominates
in ROC space if and only if it dominates in PR space. Therefore, to have a better assessment, we
used the more informative metric of AUC-PR for classifier comparisons. Table IV and V,
respectively, show
1
http://lib.stat.cmu.edu/
the data sets information and the results obtained by 10-Times 10-Folds cross validation as well
as the first experiment.
TABLE VI. DATA SETS INFORMATION
.
Data set Instance Feature + Instance - Instance
Ipums 7019 60 57 6962
Arrhythmia 452 263 13 439
BrazilTourism 412 9 16 396
Primary 339 18 14 325
Sylva.agnostic 14395 213 885 13510
Balance 625 5 49 576
Bac1ache 180 33 25 155
TABLE V. THE AUC-PR RESULTS OBTAINED BY 10-TIMES 10-FOLDS CROSS
VALIDATION
.
Data Set [11] [16] CCW CCPDT Proposed
Ipums 0.136 0.170 0.140 0.037 0.183
Arrhythmia 0.083 0.134 0.229 0.346 0.332
BrazilTourism 0.233 0.184 0.241 0.152 0.213
Primary 0.310 0.347 0.279 0.170 0.286
Sylva.agnostic 0.928 0.922 0.925 0.934 0.930
Balance 0.135 0.091 0.149 0.092 0.140
Bac1ache 0.317 0.330 0.328 0.227 0.340
Average Ran1 3.28 3.28 2.71 3.71 2.00
The results show that, in general, algorithm-oriented methods yields better performance
compared with data-oriented methods. On average, the best results are achieved by the proposed
method. The Friedman test confirms this assertion. The main drawbac1 of the proposed method
is its sensitivity to the learning rate () and b-parameter which are learned empirically. They are
dependent to the nature of data set.
V. CONCLUSION The paper proposed a dynamic feature weighting schema
along with a novel and flexible weighting function that can successfully deal with the challenges
posed by learning from imbalances, mainly related to the disproportion of instances per class in
the training data set. The proposed weighting schema is optimized for the NN algorithm where
each feature has its own weight function. The weight functions parameters are learned using an
iterative learning algorithm which tries to maximize a differentiable objective function; the
objective function is an approximation of the G-mean criteria which is common in imbalanced
data classification problems.
2
http://www.agnostic.inf.ethz.ch
35
A number of experiments involving a large collection of standard benchmar1 imbalanced data
sets showed the good performance of the proposed method.
In future our plan is to extend the idea of dynamic weighting function to other weighting
classification problems.
R
EFERENCES
[1] Q. Yang and X. Wu, 10 challenging problems in data mining research, Int. J. Inf. Technol.
Decis. Mak., vol. 5, no. 04, pp. 597 604, 2006.
[2] M. A. Mazurows1i, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Ba1er, and G. D. Tourassi,
Training neural networ1 classifiers for medical decision ma1ing: The effects of imbalanced
datasets on classification performance, Neural networks, vol. 21, no. 2, pp. 427436, 2008.
[3] Y.-M. Huang, C.-M. Hung, and H. C. Jiau, Evaluation of neural networ1s and data mining
methods on a credit assessment tas1 for class imbalance problem, Nonlinear Anal. Real World
Appl., vol. 7, no. 4, pp. 720747, 2006.
[4] Y.-H. Liu and Y.-T. Chen, Face recognition using total margin- based adaptive fuzzy
support vector machines, Neural Networks, IEEE Trans., vol. 18, no. 1, pp. 178192, 2007.
[5] W. Liu and S. Chawla, Class confidence weighted 1nn algorithms for imbalanced data sets,
in Advances in Knowledge Discovery and Data Mining, Springer, 2011, pp. 345356.
[6] P. Branco, L. Torgo, and R. Ribeiro, A Survey of Predictive Modelling under Imbalanced
Distributions, arXiv Prepr. arXiv1505.01658, 2015.
[7] X. Tong, P. Oztur1, and M. Gu, Dynamic feature weighting in nearest neighbor classifiers,
in Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference on,
2004, vol. 4, pp. 24062411.
[8] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B.
Liu, and S. Y. Philip, Top 10 algorithms in data mining, Knowl. Inf. Syst., vol. 14, no. 1, pp.
137, 2008.
[9] T. Cover and P. Hart, Nearest neighbor pattern classification, Inf.
Theory, IEEE Trans., vol. 13, no. 1, pp. 2127, 1967.
[10] R. O. Duda and P. E. Hart, Pattern recognition and scene analysis.
Wiley, New Yor1, 1973.
[11] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: synthetic
minority over-sampling technique, J. Artif. Intell. Res., vol. 16, no. 1, pp. 321357, 2002.
[12] C. Bun1humpornpat, K. Sinapiromsaran, and C. Lursinsap, Safe- level-smote: Safe-level-
synthetic minority over-sampling technique for handling the class imbalanced problem, in
Advances in Knowledge Discovery and Data Mining, Springer, 2009, pp. 475 482.
[13] T. Maciejews1i and J. Stefanows1i, Local neighbourhood extension of SMOTE for mining
imbalanced data, in Computational Intelligence and Data Mining (CIDM), 2011 IEEE
Symposium on, 2011, pp. 104111.
[14] G. Menardi and N. Torelli, Training and assessing classification rules with imbalanced
data, Data Min. Knowl. Discov., vol. 28, no. 1, pp. 92122, 2014.
[15] M. Z. Jahromi, E. Parvinnia, and R. John, A method of learning weighted similarity
function to improve the performance of nearest neighbor, Inf. Sci. (Ny)., vol. 179, no. 17, pp.
29642973, 2009.
[16] T. Yang, L. Cao, and C. Zhang, A novel prototype reduction method for the K-nearest
neighbor algorithm with K 1, in Advances in Knowledge Discovery and Data Mining,
Springer, 2010, pp. 89100.
[17] W. Liu, S. Chawla, D. A. Ciesla1, and N. V Chawla, A Robust Decision Tree Algorithm
for Imbalanced Data Sets., in SDM, 2010, vol. 10, pp. 766777.
[18] C. Liu, L. Cao, and P. S. Yu, A hybrid coupled 1-nearest neighbor algorithm on imbalance
data, in Neural Networks (IJCNN), 2014 International Joint Conference on, 2014, pp. 2011
2018.
[19] L. Peng, H. Zhang, B. Yang, and Y. Chen, A new approach for imbalanced data
classification based on data gravitation, Inf. Sci. (Ny)., vol. 288, pp. 347373, 2014.
[20] D. Tomar and S. Agarwal, An effective Weighted Multi-class Least Squares Twin Support
Vector Machine for Imbalanced data classification, Int. J. Comput. Intell. Syst., vol. 8, no. 4,
pp. 761778, 2015.
[21] R. Paredes and E. Vidal, Learning prototypes and distances: A prototype reduction
technique based on nearest neighbor error minimization, Pattern Recognit., vol. 39, no. 2, pp.
180188, 2006.
[22] Z. Hajizadeh, M. Taheri, and M. Z. Jahromi, Nearest Neighbor Classification with Locally
Weighted Distance for Imbalanced Data.
[23] H. He and E. A. Garcia, Learning from imbalanced data, Knowl.
Data Eng. IEEE Trans., vol. 21, no. 9, pp. 12631284, 2009.
[24] G. E. Batista, R. C. Prati, and M. C. Monard, A study of the behavior of several methods
for balancing machine learning training data, ACM Sigkdd Explor. Newsl., vol. 6, no. 1, pp.
2029, 2004.
[25] C. G. At1eson, A. W. Moore, and S. Schaal, Locally weighted
learning for control, in Lazy learning, Springer, 1997, pp. 75113.
[26] M. Friedman, The use of ran1s to avoid the assumption of normality implicit in the
analysis of variance, J. Am. Stat. Assoc., vol. 32, no. 200, pp. 675701, 1937.
[27] J. Davis and M. Goadrich, The relationship between Precision- Recall and ROC curves, in
Proceedings of the 23rd international conference on Machine learning, 2006, pp. 233240.
36