A Study of BPE-based Language Modeling for Open Vocabulary Latin Language OCR

2020 
We present a study of byte pair encoding (BPE) based language modeling for open vocabulary Latin language OCR. On a large-scale handwritten English OCR task, we demonstrate that a simple BPE-based n-gram language model (LM) can deal with out-of-vocabulary word problem effectively and achieve better accuracy-footprint tradeoff than a state-of-the-art hybrid word/subword n-gram LM interpolated by a standard hybrid LM, a word-based LM, and a subword-based LM. On another large-scale printed OCR task for six Latin languages, namely English, Spanish, French, German, Italian, and Portuguese, we discover that a unified OCR system with a single character-based optical model and a single BPE-based n-gram LM shared by six languages performs better than language-dependent OCR systems. BPE-based LM offers a good product solution for both monolingual and multilingual open-vocabulary Latin language OCR.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    43
    References
    1
    Citations
    NaN
    KQI
    []