From Speech Signals to Semantics — Tagging Performance at Acoustic, Phonetic and Word Levels

2018 
Spoken language understanding (SLU) is to decode the semantic information embedded in speech input. SLU decoding can be significantly degraded by mismatched acoustic/language models between training and testing of a decoder. In this paper we investigate the semantic tagging performance of bidirectional LSTM RNN (BLSTM-RNN) with input at acoustic, phonetic and word levels. It is tested on a crowdsourced, spoken dialog speech corpus spoken by non-native speakers in a job interview task. The tagging performance is shown to be improved successively from low-level, acoustic MFCC, midlevel, stochastic senone posteriorgram, to high-level, ASR recognized word string, with the corresponding tagging accuracies at 70.6%, 82.1% and 85.1%, respectively. With a score fusion of the three individual RNNs together, the accuracy can be further improved to 87.0%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    28
    References
    6
    Citations
    NaN
    KQI
    []