Controllable Emphatic Speech Synthesis based on Forward Attention for Expressive Speech Synthesis

2021 
In speech interaction scenarios, speech emphasis is essential for expressing the underlying intention and attitude. Recently, end-to-end emphatic speech synthesis greatly improves the naturalness of synthetic speech, but also brings new problems: 1) lack of interpretability for how emphatic codes affect the model; 2) no separate control of emphasis on duration and on intonation and energy. We propose a novel way to build an interpretable and controllable emphatic speech synthesis framework based on forward attention. Firstly, we explicitly model the local variation of speaking rate for emphasized words and neutral words with modified forward attention to manifest emphasized words in terms of duration. The 2-layers LSTM in decoder is further divided into attention-RNN and decoder-RNN to disentangle the influence of emphasis on duration and on intonation and energy. The emphasis information is injected into decoder-RNN for highlighting emphasized words in the aspects of intonation and energy. Experimental results have shown that our model can not only provide separate control of emphasis on duration and on intonation and energy, but also generate more robust and prominent emphatic speech with high quality and naturalness.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    2
    Citations
    NaN
    KQI
    []