Exploring Cross-lingual Singing Voice Synthesis Using Speech Data

Yuewen Cao,Songxiang Liu,Shiyin Kang,Na Hu,Peng Liu,Xunying Liu,Dan Su,Dong Yu,Helen Meng

Exploring Cross-lingual Singing Voice Synthesis Using Speech Data

2021

State-of-the-art singing voice synthesis (SVS) models can generate natural singing voice of a target speaker, given his/her speaking/singing data in the same language. However, there may be challenging conditions where only speech data in a non-target language of the target speaker is available. In this paper, we present a cross-lingual SVS system that can synthesize an English speaker’s singing voice in Mandarin from musical scores with only her speech data in English. The pro-posed cross-lingual SVS system contains four parts: a BLSTM based duration model, a pitch model, a cross-lingual acoustic model and a neural vocoder. The acoustic model employs encoder-decoder architecture conditioned on pitch, phoneme duration, speaker information and language information. An adversarially-trained speaker classifier is employed to discourage the text encodings from capturing speaker information. Objective evaluation and subjective listening tests demonstrate that the proposed cross-lingual SVS system can generate singing voice with decent naturalness and fair speaker similarity. We also find that adding singing data or multi-speaker monolingual speech data further improves generalization on pronunciation and pitch accuracy.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations