UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech

This study focuses on acoustic variations in speech introduced by whispering, and proposes several strategies to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models. In the analysis part, differences in neutral and whispered speech captured in the UT-Vocal Effort II corpus are studied in terms of energy, spectral slope, and formant center frequency and bandwidth distributions in silence, voiced, and unvoiced speech signal segments. In the part dedicated to speech recognition, several strategies involving front-end filter bank redistribution, cepstral dimensionality reduction, and lexicon expansion for alternative pronunciations are proposed. The proposed neutral-trained system employing redistributed filter bank and reduced features provides a 7.7% absolute WER reduction over the baseline system trained on neutral speech, and a 1.3% reduction over a baseline system with whisper-adapted acoustic models.
    • Correction
    • Source
    • Cite
    • Save