Decoding Knowledge Transfer for Neural Text-to-Speech Training

2022 
Neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways. However, the exposure bias problem, that arises from the mismatch between the training and inference process in autoregressive models, remains an issue. It often leads to performance degradation in face of out-of-domain test data. To address this problem, we study a novel decoding knowledge transfer strategy, and propose a multi-teacher knowledge distillation (MT-KD) network for Tacotron2 TTS model. The idea is to pre-train two Tacotron2 TTS teacher models in teacher forcing and scheduled sampling modes, and transfer the pre-trained knowledge to a student model that performs free running decoding. We show that the MT-KD network provides an adequate platform for neural TTS training, where the student model learns to emulate the behaviors of the two teachers, at the same time, minimizing the mismatch between training and run-time inference. Experiments on both Chinese and English data show that MT-KD system consistently outperforms the competitive baselines in terms of naturalness, robustness and expressiveness for in-domain and out-of-domain test data. Furthermore, we show that knowledge distillation outperforms adversarial learning and data augmentation in addressing the exposure bias problem.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    72
    References
    1
    Citations
    NaN
    KQI
    []