Training universal background models with restricted data for speech emotion recognition

2021 
Speech emotion recognition (SER) is an important research topic which relies heavily on emotional data. Admitting that SER has seen some recent advancements, Universal Background Model (UBM), a standard reference concept from a neighbouring field which is speaker recognition, has always been the base module for the newly developed methods such as Joint Factor Analysis. Theoretically, UBM is a Gaussian model trained with an extensive and representative set of speech samples recorded from different target classes in order to extract general feature characteristics. Obtaining large amount of emotional data to train UBM is a challenging task, further complicated by the cost of annotations and the ambiguity of resulting labels. In addition, it’s dependent upon the training data. In this paper, we make preliminary exploration on a new approach: Training UBM models, named as restricted UBM, with a small amount of speech, which can be even different from the training data. Experiments show that this approach results in a domain-independent UBM capable of designing an acoustic model transferable to different datasets. Four standard benchmark speech databases from different languages have been used for the experimental evaluation. The results show that our proposed model outperforms the existing state of the art baselines. Moreover, we applied this approach on emotional speaker recognition.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    48
    References
    0
    Citations
    NaN
    KQI
    []