Factorized WaveNet for voice conversion with limited data

2021 
Abstract WaveNet is introduced for waveform generation. It produces high quality text-to-speech synthesis, music generation, and voice conversion. However, it generally requires a large amount of training data, that limits its scope of applications, e.g. in voice conversion. In this paper, we propose a factorized WaveNet for limited data tasks. Specifically, we apply singular value decomposition (SVD) on the dilated convolution layers of WaveNet to reduce the number of parameters. By doing so, we reduce the data requirement for WaveNet training, while maintaining similar network performance. We use voice conversion as a case study to validate the proposed idea. Two sets of experiments are conducted, where WaveNet is used as a vocoder and an integrated converter-vocoder respectively. Experiments on CMU-ARCTIC and CSTR-VCTK corpora show that factorized WaveNet consistently outperforms its original WaveNet counterpart when using the same amount of training data. We also apply SVD similarly to real-time neural vocoder Parallel WaveGAN for voice conversion, and observe similar improvement.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    59
    References
    0
    Citations
    NaN
    KQI
    []