Towards Data Selection on TTS Data for Children’s Speech Recognition

2021 
Although great progress has been made on automatic speech recognition (ASR) systems, children’s speech recognition still remains a challenging task. General ASR systems for children’s speech suffer from the lack of corpora and mismatch between children’s and adults’ speech. Efforts have been made to reduce such mismatch by applying normalization methods to generate modified adults’ speech for ASR training. However, modified adults’ data can reflect the characteristics of children’s speech to a very limited extent. In this work, we adopt text-to-speech data augmentation to improve the performance of children’s speech recognition system. We find that the children’s TTS model generates speech with inconsistent quality due to children’s substandard pronunciations of phonemes, and the ASR system suffers when trained with these additional synthesized data. To solve this problem, we propose data selection strategies on the TTS augmented data, and the effectiveness of the synthesized data can be substantially boosted for children’s ASR modeling. We show that the speaker embedding similarity based data selection strategy can obtain the best position: relative 14.0% and 14.7% CER reduction for child conversation and child reading test set respectively compared to the baseline model trained on real data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    0
    Citations
    NaN
    KQI
    []