Sunteți pe pagina 1din 2

Machine Learning Techniques :

Handling class imbalance:


The given training set comprised of 700 class A samples and 2800 class B samples. With 1:4 class
imbalance, many of the standard algorithms tend to classify some of class B points as class A. To
account for class imbalance, various approaches were implemented both in dataset level and
algorithmic level. Oversampling and undersampling are the standard approaches in dataset level
whereas modifications to various algorithms exist to account for class imbalance in algorithmic level.
Dataset level approaches:
Oversampling the minority class: Minority class data points were oversampled by using SMOTE
(Synthetic Minority Oversampling Technique) algorithm. In SMOTE algorithm, oversampling is
implemented by choosing each minority class sample and introducing synthetic examples along the
line joining any/ all of the k minority class neighbours. A modification to this approach was
implemented whereby instead of points along line segments, centroids of 3-closest minority class
points were introduced. This method showed success, but didnt improve performance by a
significant amount.
Undersampling the majority class: Another standard technique to account for class imbalance is
undersampling the majority class. Different approaches were tried as listed below:
1. Bagging with Split datasets: Four different training sets were made and the 700 minority
class samples were included in each of the sets. Majority class was split into four 700 sample
sets and each one was included in one set. This resulted in 4 different datasets with 1400
samples each. Training was done on these four sets individually and the results from these
four classifiers were bagged together to produce the final output. The main problem with
this approach was false positive rate( considering minority class as positive class) was very
high.
2. Removing noisy samples using linear regression : Linear regression was training and the
distance of data points from decision boundary were stored . All majority class points close
to the boundary and on the wrong side of the boundary were removed. This method helped
increase performance.
3. Removing noisy samples using K-nn : In this approach, majority class samples close to a
significant fraction of minority class samples were removed. For each majority class data
point, K- nearest data points were chosen and if the number of minority class points among
these K points were more than M, the majority class point was removed. Parameters K and
M were estimated using cross-validation. In our setting, K value of 20 and M value of 7
performed the best. This method showed good success in improving the performance.
Among these undersampling techniques, removing points using K-nn performed the best.
Algorithmic level approaches:
Balance weights feature in various scikit classifiers were selected. This would help in balancing
weights before the trees were learnt.

Final approach used :


Combinations of undersampling and oversampling approaches were used in our final model.
Number of minority class samples was increased to 1000 using the modified SMOTE algorithm. With
noise removal using K-nn, 500 noisy majority class points were decimated. The resulting dataset
showed good increase in performance in many of the standard classifiers.