Exploration of an Independent Training Framework for Speech Emotion Recognition

2020 
Speech emotion recognition (SER) plays an indispensable role in human-computer interaction tasks, where the ultimate performance is determined by features, such as empirically learned features (ELFs) and automatically learned features (ALFs). Although the fusion of both ELFs and ALFs can complement some new features for SER, the fused training within one softmax layer is inappropriate due to the different performance of using either ELFs or ALFs for emotion recognition. Based on this consideration, this paper proposes an independent training framework that can fully enjoy the complementary advantages of human knowledge and powerful learning ability of deep learning models. Specifically, we first feed Mel frequency cepstral coefficient and openSMILE features respectively into a pair of independent models, which are composed of an attention-based convolution long short-term memory neural network and a fully connected neural network. We then design a feedback mechanism for each model to extract ALFs and ELFs independently, where hard example mining and re-training with a hard example loss are applied to focus the feature extraction on hard examples during training. Finally, a classifier is adopted to distinguish emotion by using both the independent features of ALFs and ELFs. Based on extensive experiments on three public speech emotion datasets (IEMOCAP, EMODB, and CASIA), we show that the proposed independent training framework outperforms the conventional feature fusion methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    49
    References
    1
    Citations
    NaN
    KQI
    []