Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

Francis Zheng,Machel Reid,Edison Marrese-Taylor,Yutaka Matsuo

Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

2021

Francis Zheng
Machel Reid
Edison Marrese-Taylor
Yutaka Matsuo

This paper describes UTokyo’s submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a low-resource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of fairseq to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Ashaninka, Guarani, Wixarika, Nahuatl, Hnahnu, Quechua, Shipibo-Konibo, and Raramuri. On average, our system achieved BLEU scores that were 1.64 higher and chrF scores that were 0.0749 higher than the baseline.

Keywords:

Artificial intelligence
Nahuatl
task
low resource
Machine translation
Indigenous
Computer science
Set (abstract data type)
Natural language processing
Language model
cross lingual

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations