Chinese Multi-word Chunks Extraction for Computer Aided Translation

2007 
This paper suggests a methodology which is aimed to extract multi word chunks for translation purposes.Our basic idea is to use a hybrid method which combines the statistical method and linguistic rules.The extraction system used in our work operated at four steps:(1) Tokenization of Chinese corpus;(2) Extraction of multi-word chunks(2-gram to 10-gram) using Nagao's Algorithm and Substring Reduction Algorithm;(3)Statistical Filtering which combines Mutual Information(or Log-likelihood Ratio) and Left/Right Entropy;(4) Linguistic filtering by chunk formation rules and stop-word list.As a result,hybrid method proved to be a suitable method for selecting multi-word chunks,it has considerably improved the precision of the extraction which is much higher than that of purely statistical method.We believe that multi-word chunks extracted in this way could be used effectively to supplement existing translation memory database.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    2
    Citations
    NaN
    KQI
    []