Using Personalized Speech Synthesis and Neural Language Generator for Rapid Speaker Adaptation

2020 
We propose to use the personalized speech synthesis and the neural language generator to synthesize content relevant personalized speech for rapid speaker adaptation. It has two distinct aspects: First, it relieves the general data sparsity issue in rapid adaptation via making use of additional synthesized personalized speech; Second, it circumvents the obstacle of the explicit labeling error in unsupervised adaptation by converting it to pseudo-supervised adaptation. In this setup, the labeling error is implicitly rendered as less damaging speech distortion in the personalized synthesized speech. This results in significant performance breakthrough in the rapid unsupervised speaker adaptation. We apply the proposed methodology to a speaker adaptation task in a state-of-art speech transcription system. With 1 minute (min) adaptation data, our proposed approach yields 9.19 % or 5.98 % relative word error rate (WER) reduction for the supervised and the unsupervised adaptation, comparing to the negligible gain when adapting only with 1 min original speech. With 10 min adaptation data, it yields 12.53 % or 7.89 % relative WER reduction, doubling the gain of the baseline adaptation. The proposed approach is particularly suitable for unsupervised adaptation.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    12
    Citations
    NaN
    KQI
    []