Automatic Parallel Fragment Extraction from Noisy Data

Jason Riesa,Daniel Marcu

Automatic Parallel Fragment Extraction from Noisy Data

2012

Jason Riesa
Daniel Marcu

We present a novel method to detect parallel fragments within noisy parallel corpora. Isolating these parallel fragments from the noisy data in which they are contained frees us from noisy alignments and stray links that can severely constrain translation-rule extraction. We do this with existing machinery, making use of an existing word alignment model for this task. We evaluate the quality and utility of the extracted data on large-scale Chinese-English and Arabic-English translation tasks and show significant improvements over a state-of-the-art baseline.

Keywords:

Machine learning
Noisy data
Computer science
Artificial intelligence
Pattern recognition
parallel corpora

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations