A Bilingual Multi-type Spam Detection Model Based on M-BERT

2020 
Spam has harassed Internet users for a long time, and how to detect spam accurately and efficiently is a critical problem. As yet, there are lots of research works proposed to detect spam, e.g., black and white lists, machine learning methods, and deep learning content-level measures, etc. Based on previous works, we find that most of methods' accuracy can reach 0.95 when they focus on one type and one language spam. Nevertheless, nowadays, people will receive spam messages of different types, different sources, and even different languages. Toward this, we develop a novel model, which is based on Google multilingual bidirectional encoder representations from transformers (M-BERT). Meanwhile, we design a brand new bilingual multi-type spam dataset to train our model. Particularly, we utilize optical character recognition (OCR) to extract text from image-based spam. Through the experiment, we find that the proposed model's accuracy can reach 0.9648, which outperforms the comparison models. In terms of time overhead, the proposed model only costs 0.3168 seconds per training step, which is an acceptable overhead. Therefore, these analysis results demonstrate that our approach can detect bilingual multi-type spam effectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    1
    Citations
    NaN
    KQI
    []