Emotional Voice Conversion using Dual Supervised Adversarial Networks with Continuous Wavelet Transform F0 Features

2019 
In emotional voice conversion VC tasks, it is difficult to deal with a simple representation of fundamental frequency F0, which is the most important feature in emotional voice representation. In order to address this issue, we propose the adaptive scales continuous wavelet transform ADS-CWT method to systematically capture F0 features of different temporal levels, which can represent different prosodic aspects, ranging from micro-prosody to sentences. Moreover, in an emotional VC task, each dataset is paired with the labeled emotional voice and neutral voice, which can be regarded as a dual task. Owing to, first, dual supervised learning's ability to improve the training performances by using the leveraging probabilistic connection between the dual tasks to enhance the learning from labeled data and, second, generative adversarial networks’ GANs’ ability to mitigate the over-smoothing problem caused in the low-level data space when converting the acoustic features, we further present a novel training framework for emotional VC using GANs combined with dual supervised learning, named as dual supervised adversarial networks. In emotional VC experiments, we confirmed the high similarity performance of our method when using limited labeled data for emotional VC. Our method achieves good and consistent performance, in both objective and subjective evaluations.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    16
    Citations
    NaN
    KQI
    []