“An Imbalanced Data Classification Method based on Automatic Clustering
under-sampling”. Authors ‘Xiaoheng Deng”, “Weijian Zhong”, “Ju Ren” faced that as data increases day by day, it gets noisy and more complicated. And these complicated and noisy data introduced the new challenges. One of this challenge is IMBALANCED DATA. For this ‘Imbalanced Data’, the no. of negative samples is greater than that of positive sample & applying standard classifier learning algorithm and evaluation criterion caused positive samples to be ignored or treated as noisy. It has some difficulties as follow 1. Imbalanced no. of classes 2. Overlapping -make normal data submerged easily. 3. Small disjuncts.- Complicates da-distribution. To deal with the difficulties of standard classifier and evaluation criterion, Author proposed an improved algorithm based on Automatic Clustering and Under-Sampling(ACUS). It works as follows: 1. Select samples from different clusters 2. Use variance to determine if cluster could be divided. 3. Determine importance of cluster by its weight so that imp. Sample can be found. The ACUS algo. Is based on framework of Adaboost algo., in which weights of samples are modified and classifier is trained. After some iterations clusters of negative sampling are divided into sub cluster until no. of clusters are not less than max no. of clusters. Before training new classifier, samples are extracted according to weights of samples in cluster. This method can detect representative samples better w/o calculating complex distance as compared to traditional method.ACUS consists of following 3 steps:: 1. Clustering of samples in majority class 2. Sampling from clusters 3. Training ensemble classifier. The time complexity of ACUS is not worse than K-Means but better than hierarchical clustering in clustering procedures. In addition, ACUS yields significantly better results.[ ACUS:: O(Nn log (Nn) t), O(Nplog Nn), and O(tB), respectively][K-Means:: O(Ntd)][Heirarchical:: O(N2dlogN)] Advantages: 1. ACUS can distinguish negative sampling that are closer to positive samples better as compared to K-Means and hierarchical clustering algo. 2. ACU is good at screening out unimportant negative samples as noisy sample by dividing them into several separate clusters. 3. ACUS is more efficient and effective in selecting useful samples from both the positive and negative samples. Disadvantages: 1. When sample overlapping degree occurs high, ACUS ignores most of overlapped negative samples. 2. In ACUS , due to undersampling instability is occurred, due to which kappa coefficient is computed by running every algo. 5 times based on training set and gives 10 experimental results.