From Speech Signals to Semantics — Tagging Performance at Acoustic, Phonetic and Word Levels

Yao Qian,Rutuja Ubale,Patrick Lange,Keelan Evanini,Frank Soong

From Speech Signals to Semantics — Tagging Performance at Acoustic, Phonetic and Word Levels

2018

Spoken language understanding (SLU) is to decode the semantic information embedded in speech input. SLU decoding can be significantly degraded by mismatched acoustic/language models between training and testing of a decoder. In this paper we investigate the semantic tagging performance of bidirectional LSTM RNN (BLSTM-RNN) with input at acoustic, phonetic and word levels. It is tested on a crowdsourced, spoken dialog speech corpus spoken by non-native speakers in a job interview task. The tagging performance is shown to be improved successively from low-level, acoustic MFCC, midlevel, stochastic senone posteriorgram, to high-level, ASR recognized word string, with the corresponding tagging accuracies at 70.6%, 82.1% and 85.1%, respectively. With a score fusion of the three individual RNNs together, the accuracy can be further improved to 87.0%.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations