A Study of BPE-based Language Modeling for Open Vocabulary Latin Language OCR
2020
We present a study of byte pair encoding (BPE) based language modeling for open vocabulary Latin language OCR. On a large-scale handwritten English OCR task, we demonstrate that a simple BPE-based n-gram language model (LM) can deal with out-of-vocabulary word problem effectively and achieve better accuracy-footprint tradeoff than a state-of-the-art hybrid word/subword n-gram LM interpolated by a standard hybrid LM, a word-based LM, and a subword-based LM. On another large-scale printed OCR task for six Latin languages, namely English, Spanish, French, German, Italian, and Portuguese, we discover that a unified OCR system with a single character-based optical model and a single BPE-based n-gram LM shared by six languages performs better than language-dependent OCR systems. BPE-based LM offers a good product solution for both monolingual and multilingual open-vocabulary Latin language OCR.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
43
References
1
Citations
NaN
KQI