A Comparison of Expressive Speech Synthesis Approaches based on Neural Network

Liumeng Xue,Xiaolian Zhu,Xiaochun An,Lei Xie

A Comparison of Expressive Speech Synthesis Approaches based on Neural Network

2018

Adaptability and controllability in changing speaking styles and speaker characteristics are the advantages of deep neural networks (DNNs) based statistical parametric speech synthesis (SPSS). This paper presents a comprehensive study on the use of DNNs for expressive speech synthesis with a small set of emotional speech data. Specifically, we study three typical model adaptation approaches: (1) retraining a neural model by emotion-specific data (retrain), (2) augmenting the network input using emotion-specific codes (code) and (3) using emotion-dependent output layers with shared hidden layers (multi-head). Long-short term memory (LSTM) networks are used as the acoustic models. Objective and subjective evaluations have demonstrated that the multi-head approach consistently outperforms the other two approaches with more natural emotion delivered in the synthesized speech.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations