Enhancing Hybrid Self-attention Structure with Relative-position-aware Bias for Speech Synthesis

2019 
Compared with the conventional "front-end"–"back-end"– "vocoder" structure, based on the attention mechanism, end-to-end speech synthesis systems directly train and synthesize from text sequence to the acoustic feature sequence as a whole. Recently, a more calculation efficient end-to-end architecture named transformer, which is solely based on self-attention, was proposed to model global dependencies between the input and output sequences. However, although with many advantages, transformer lacks position information in its structure. Moreover, the weighted sum form in self-attention may disperse the attention to the whole input sequence other than focusing on the more important neighbouring positions. In order to solve the above problems, this paper introduces a hybrid self-attention structure which combines self-attention with the recurrent neural networks (RNNs). We further enhance the proposed structure with relative-position-aware biases. Mean opinion score (MOS) test results indicate that by enhancing hybrid self-attention structure with relative-position-aware biases, the proposed system achieves the best performance with only 0.11 MOS score lower than natural recording.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    10
    Citations
    NaN
    KQI
    []