Enhancing Hybrid Self-attention Structure with Relative-position-aware Bias for Speech Synthesis

Shan Yang,Heng Lu,Shiying Kang,Lei Xie,Dong Yu

Enhancing Hybrid Self-attention Structure with Relative-position-aware Bias for Speech Synthesis

2019

Compared with the conventional "front-end"–"back-end"– "vocoder" structure, based on the attention mechanism, end-to-end speech synthesis systems directly train and synthesize from text sequence to the acoustic feature sequence as a whole. Recently, a more calculation efficient end-to-end architecture named transformer, which is solely based on self-attention, was proposed to model global dependencies between the input and output sequences. However, although with many advantages, transformer lacks position information in its structure. Moreover, the weighted sum form in self-attention may disperse the attention to the whole input sequence other than focusing on the more important neighbouring positions. In order to solve the above problems, this paper introduces a hybrid self-attention structure which combines self-attention with the recurrent neural networks (RNNs). We further enhance the proposed structure with relative-position-aware biases. Mean opinion score (MOS) test results indicate that by enhancing hybrid self-attention structure with relative-position-aware biases, the proposed system achieves the best performance with only 0.11 MOS score lower than natural recording.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations