A Study of BPE-based Language Modeling for Open Vocabulary Latin Language OCR

Wenping Hu,Yikang Luo,Ji Meng,Zifei Qian,Qiang Huo

A Study of BPE-based Language Modeling for Open Vocabulary Latin Language OCR

2020

We present a study of byte pair encoding (BPE) based language modeling for open vocabulary Latin language OCR. On a large-scale handwritten English OCR task, we demonstrate that a simple BPE-based n-gram language model (LM) can deal with out-of-vocabulary word problem effectively and achieve better accuracy-footprint tradeoff than a state-of-the-art hybrid word/subword n-gram LM interpolated by a standard hybrid LM, a word-based LM, and a subword-based LM. On another large-scale printed OCR task for six Latin languages, namely English, Spanish, French, German, Italian, and Portuguese, we discover that a unified OCR system with a single character-based optical model and a single BPE-based n-gram LM shared by six languages performs better than language-dependent OCR systems. BPE-based LM offers a good product solution for both monolingual and multilingual open-vocabulary Latin language OCR.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations