Deep speaker conditioning for speech emotion recognition

Andreas Triantafyllopoulos,Shuo Liu,Björn Schuller

Deep speaker conditioning for speech emotion recognition

2021

Andreas Triantafyllopoulos
Shuo Liu
Björn Schuller

In this work, we explore the use of speaker conditioning sub-networks for speaker adaptation in a deep neural network (DNN) based speech emotion recognition (SER) system. We use a ResNet architecture trained on log spectrogram features, and augment this architecture with an auxiliary network providing speaker embeddings, which conditions multiple layers of the primary classification network on a single neutral speech sample of the target speaker. The whole system is trained end-to-end using a standard cross-entropy loss for utterance-level SER. Relative to the same architecture without the auxiliary embedding sub-network, we are able to improve by 8.3% on IEMOCAP, and by 5.0% and 30.9% on the 2-class and 5-class SER tasks on FAU-AIBO, respectively.

Keywords:

network on
Computer science
Artificial neural network
whole systems
Embedding
Emotion recognition
Conditioning
Spectrogram
Speech recognition
speaker adaptation

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations