Deep speaker conditioning for speech emotion recognition
2021
In this work, we explore the use of speaker conditioning sub-networks for speaker adaptation in a deep neural network (DNN) based speech emotion recognition (SER) system. We use a ResNet architecture trained on log spectrogram features, and augment this architecture with an auxiliary network providing speaker embeddings, which conditions multiple layers of the primary classification network on a single neutral speech sample of the target speaker. The whole system is trained end-to-end using a standard cross-entropy loss for utterance-level SER. Relative to the same architecture without the auxiliary embedding sub-network, we are able to improve by 8.3% on IEMOCAP, and by 5.0% and 30.9% on the 2-class and 5-class SER tasks on FAU-AIBO, respectively.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
0
References
4
Citations
NaN
KQI