DCSpell: A Detector-Corrector Framework for Chinese Spelling Error Correction

2021 
Spelling Error Correction (SEC) that detects and corrects spelling errors in a text has a wide range of applications in human language understanding. Earlier solutions, including statistic-based methods, one-stage, and two-stage machine learning-based methods, cannot build deeply bidirectional models and significantly confine the learning ability. With the recently emerging masked language models, transformer-based networks have achieved remarkable success in SEC. However, current transformer-based Chinese SEC algorithms are all end-to-end methods, which suffer from high false alarm rates because they correct each character of the sentence regardless of its correctness. This issue becomes even more severe when there exist only a small fraction of incorrect characters in the whole sentence. To solve this problem, we propose a cloze-style detector-corrector framework (DCSpell) that firstly detects whether a character is erroneous before correcting it. Specifically, DCSpell employs the discriminator of ELECTRA as the Detector to detect the positions of incorrect characters. The Detector is trained by a sample-efficient replaced token detection pre-training task, and thus allows domain adaption with a small amount of data. After that, a transformer-based Corrector is used to find the correct character for each detected position. It employs sentence pairs as the input, which potentially incorporates the knowledge of phonological and visual similarity. A confusion-set-based post-processing is used to further improve the performance. Experiments show that DCSpell achieves 15.7% improvement on the SIGHAN dataset and 6.6% improvement on a dataset transcribed from a real-world acoustic speech corpus compared to the state-of-the-art methods in terms of the F1 score.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    0
    Citations
    NaN
    KQI
    []