Sequence Generation with Mixed Representations

Lijun Wu,Shufang Xie,Yingce Xia,Yang Fan,Jianhuang Lai,Tao Qin,Tie-Yan Liu

Sequence Generation with Mixed Representations

2020

Tokenization is the first step of many natural language processing (NLP) tasks and plays an important role for neural NLP models. Tokenization methods such as byte-pair encoding and SentencePiece, which can greatly reduce the large vocabulary size and deal with out-of-vocabulary words, have shown to be effective and are widely adopted for sequence generation tasks. While various tokenization methods exist, there is no common acknowledgement which one is the best. In this work, we propose to leverage the mixed representations from different tokenizers for sequence generation tasks, which can take the advantages of each individual tokenization method. Specifically, we introduce a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods. Our approach achieves significant improvements on neural machine translation tasks with six language pairs, as well as an abstractive summarization task.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations