Active Learning for Speech Emotion Recognition Using Deep Neural Network

2019 
Deep neural networks (DNNs) have consistently pushed the state-of-the-art performance in many fields, including speech emotion recognition. However, DNN-based solutions require vast amounts of labeled data for training. In speech emotion recognition, the cost and time needed to annotate data with emotional labels can be prohibitive. The available corpora normally have a few thousand recordings collected by a limited number of speakers. As a result, models trained on such corpora fail to generalize to samples from new domains. This study explores practical solutions to train DNNs for speech emotion recognition with limited resources by using active learning (AL). We assume that data without emotional labels from a new domain are available and we have resources to select a limited number of recordings to be annotated with emotional labels. We actively select samples using greedy sampling (GS) and uncertainty-based methods, evaluating the performance on regression problems where the goal is to predict scores for arousal and valence. We show that the use of active learning leads to competitive performance with limited training data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    35
    References
    10
    Citations
    NaN
    KQI
    []