RFCL: A new under-sampling method of reducing the degree of imbalance and overlap

2020 
Imbalanced data are often encountered in every aspect of our lives, such as medical science, Internet, finance, and surveillance. Learning from imbalanced data which is also called the imbalanced learning problem is still a big challenge and deserves more attention. In this paper, we focus on overlap, which is one of the most important inherent factors that hinder learning from imbalanced data well. We put forward the overlapping degree (OD), and grouped data sets into two types, high OD (HOD) and low OD (LOD). The experimental results found that LOD data sets can achieve good results without any under-sampling algorithm, though some of them have high degree of imbalance, and the under-sampling algorithm does not improve the results very much. A new under-sampling algorithm, random forest cleaning rule (RFCL), was proposed to remove the majority class instances that cross the given new classification boundary which is a margin’s threshold. The degree of overlap and imbalance will be decreased in this way. This threshold is searched by maximizing the F1-score of the final classifier. Experimental results show that RFCL outperforms seven classic and two latest under-sampling methods in terms of F1-score and area under the curve, whether using random forest or support vector machine as the final classifier.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    49
    References
    1
    Citations
    NaN
    KQI
    []