TextCut: A Multi-region Replacement Data Augmentation Approach for Text Imbalance Classification

2021 
In the practical applications of text classification, data imbalance problems occur frequently, which typically leads to prejudice of a classifier against the majority group. Therefore, how to handle imbalanced text datasets to alleviate the skew distribution is a crucial task. Existing mainstream methods tackle it by utilizing interpolation-based augmentation strategies to synthesize new texts according to minority class texts. However, it may mess up the syntactic and semantic information of the original texts, which makes it challenging to model the new texts. We propose a novel data augmentation method based on paired samples, called TextCut, to overcome the above problem. For a minority class text and its paired text, TextCut samples multiple small square regions of the minority text in the hidden space and replaces them with corresponding regions cutout from the paired text. We build TextCut upon the BERT model to better capture the features of minority class texts. We verify that TextCut can further improve the classification performance of the minority and entire categories, and effectively alleviate the imbalanced problem on three benchmark imbalanced text datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []