0 evaluări0% au considerat acest document util (0 voturi)
40 vizualizări2 pagini
In many machine learning classification problems, class imbalance is a major issue that results in algorithm favoring minority class. This document describes some good techniques to avoid this issue.
In many machine learning classification problems, class imbalance is a major issue that results in algorithm favoring minority class. This document describes some good techniques to avoid this issue.
In many machine learning classification problems, class imbalance is a major issue that results in algorithm favoring minority class. This document describes some good techniques to avoid this issue.
The given training set comprised of 700 class A samples and 2800 class B samples. With 1:4 class imbalance, many of the standard algorithms tend to classify some of class B points as class A. To account for class imbalance, various approaches were implemented both in dataset level and algorithmic level. Oversampling and undersampling are the standard approaches in dataset level whereas modifications to various algorithms exist to account for class imbalance in algorithmic level. Dataset level approaches: Oversampling the minority class: Minority class data points were oversampled by using SMOTE (Synthetic Minority Oversampling Technique) algorithm. In SMOTE algorithm, oversampling is implemented by choosing each minority class sample and introducing synthetic examples along the line joining any/ all of the k minority class neighbours. A modification to this approach was implemented whereby instead of points along line segments, centroids of 3-closest minority class points were introduced. This method showed success, but didnt improve performance by a significant amount. Undersampling the majority class: Another standard technique to account for class imbalance is undersampling the majority class. Different approaches were tried as listed below: 1. Bagging with Split datasets: Four different training sets were made and the 700 minority class samples were included in each of the sets. Majority class was split into four 700 sample sets and each one was included in one set. This resulted in 4 different datasets with 1400 samples each. Training was done on these four sets individually and the results from these four classifiers were bagged together to produce the final output. The main problem with this approach was false positive rate( considering minority class as positive class) was very high. 2. Removing noisy samples using linear regression : Linear regression was training and the distance of data points from decision boundary were stored . All majority class points close to the boundary and on the wrong side of the boundary were removed. This method helped increase performance. 3. Removing noisy samples using K-nn : In this approach, majority class samples close to a significant fraction of minority class samples were removed. For each majority class data point, K- nearest data points were chosen and if the number of minority class points among these K points were more than M, the majority class point was removed. Parameters K and M were estimated using cross-validation. In our setting, K value of 20 and M value of 7 performed the best. This method showed good success in improving the performance. Among these undersampling techniques, removing points using K-nn performed the best. Algorithmic level approaches: Balance weights feature in various scikit classifiers were selected. This would help in balancing weights before the trees were learnt.
Final approach used :
Combinations of undersampling and oversampling approaches were used in our final model. Number of minority class samples was increased to 1000 using the modified SMOTE algorithm. With noise removal using K-nn, 500 noisy majority class points were decimated. The resulting dataset showed good increase in performance in many of the standard classifiers.