A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction

Rui Liu,Feilong Bao,Guanglai Gao,Hui Zhang,Yonghe Wang

A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction

2018

Rui Liu
Feilong Bao
Guanglai Gao
Hui Zhang
Yonghe Wang

In this paper, we first utilize the word embedding that focuses on sub-word units to the Mongolian Phrase Break (PB) prediction task by using Long-Short-Term-Memory (LSTM) model. Mongolian is an agglutinative language. Each root can be followed by several suffixes to form probably millions of words, but the existing Mongolian corpus is not enough to build a robust entire word embedding, thus it suffers a serious data sparse problem and brings a great difficulty for Mongolian PB prediction. To solve this problem, we look at sub-word units in Mongolian word, and encode their information to a meaningful representation, then fed it to LSTM to decode the best corresponding PB label. Experimental results show that the proposed model significantly outperforms traditional CRF model using manually features and obtains 7.49 F-Measure gain.

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations