End-To-End Voice Conversion Via Cross-Modal Knowledge Distillation for Dysarthric Speech Reconstruction

Disong Wang,Jianwei Yu,Xixin Wu,Songxiang Liu,Lifa Sun,Xunying Liu,Helen Meng

End-To-End Voice Conversion Via Cross-Modal Knowledge Distillation for Dysarthric Speech Reconstruction

2020

Dysarthric speech reconstruction (DSR) is a challenging task due to difficulties in repairing unstable prosody and correcting imprecise articulation. Inspired by the success of sequence-to-sequence (seq2seq) based text-to-speech (TTS) synthesis and knowledge distillation (KD) techniques, this paper proposes a novel end-to-end voice conversion (VC) method to tackle the reconstruction task. The proposed approach contains three components. First, a seq2seq based TTS is first trained with the transcribed normal speech. Second, with the text-encoder of this trained TTS system as "teacher", a teacher-student framework is proposed for cross-modal KD by training a speech-encoder to extract appropriate linguistic representations from the transcribed dysarthric speech. Third, the speech-encoder of the previous component is concatenated with the attention and decoder of the first component (TTS) to perform the DSR task, by directly mapping the dysarthric speech to its normal version. Experiments demonstrate that the proposed method can generate the speech with high naturalness and intelligibility, where the comparisons of human speech recognition between the reconstructed speech and the original dysarthric speech show that 35.4% and 48.7% absolute word error rate (WER) reduction can be achieved for dysarthric speakers with low and very low speech intelligibility, respectively.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations