A Bilingual Multi-type Spam Detection Model Based on M-BERT

Jie Cao,Chengzhe Lai

A Bilingual Multi-type Spam Detection Model Based on M-BERT

2020

Spam has harassed Internet users for a long time, and how to detect spam accurately and efficiently is a critical problem. As yet, there are lots of research works proposed to detect spam, e.g., black and white lists, machine learning methods, and deep learning content-level measures, etc. Based on previous works, we find that most of methods' accuracy can reach 0.95 when they focus on one type and one language spam. Nevertheless, nowadays, people will receive spam messages of different types, different sources, and even different languages. Toward this, we develop a novel model, which is based on Google multilingual bidirectional encoder representations from transformers (M-BERT). Meanwhile, we design a brand new bilingual multi-type spam dataset to train our model. Particularly, we utilize optical character recognition (OCR) to extract text from image-based spam. Through the experiment, we find that the proposed model's accuracy can reach 0.9648, which outperforms the comparison models. In terms of time overhead, the proposed model only costs 0.3168 seconds per training step, which is an acceptable overhead. Therefore, these analysis results demonstrate that our approach can detect bilingual multi-type spam effectively.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations