Leveraging arabic-english bilingual corpora with crowd sourcing-based annotation for arabic-hebrew SMT

Manish Gaurav,Guruprasad Saikumar,Amit Srivastava,Premkumar Natarajan,Shankar Ananthakrishnan,Spyros Matsoukas

Leveraging arabic-english bilingual corpora with crowd sourcing-based annotation for arabic-hebrew SMT

2013

Recent studies in Statistical Machine Translation (SMT) paradigm have been focused on developing foreign language to English translation systems. However as SMT systems have matured, there is a lot of demand to translate from one foreign language to another language. Unfortunately, the availability of parallel training corpora for a pair of morphologically complex foreign languages like Arabic and Hebrew is very scarce. This paper uses active learning based data selection and crowd sourcing technique like Amazon Mechanical Turk to create Arabic-Hebrew parallel corpora. It then explores two different techniques to build Arabic-Hebrew SMT system. The first one involves the traditional cascading of two SMT systems using English as a pivot language. The second approach is training a direct Arabic-Hebrew SMT system using sentence pivoting. Finally, we use a phrase generalization approach to further improve our performance.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations