Deep Graph Fusion Based Multimodal Evoked Expressions From Large-Scale Videos

2021 
Multiple sources of noise can impair the machine’s capacity to learn approximate ground truth in the case of emotion recognition for wild input signals with high variations. Numerous research have been conducted on directly identifying characters’ affective expressions via face, speech, and text. However, there are few studies on the prediction of a character’s emotions based on the content they watch. As a result, in this paper, we propose a hybrid fusion model termed deep graph fusion for predicting viewers’ elicited expressions from videos by leveraging the combination of visual-audio representations. The proposed system is comprised of four stages. To begin, we extract features for each 30-second segment’s visual and auditory modalities using CNN-based pre-trained models to understand their salient representations. Then, we reconstitute these characteristics as graph outlines and use graph convolutional networks to perform node embedding. In the third phase, we offer several fusion modules for combining visual and auditory branch graph representations. Finally, the fused features are utilized to estimate the evoked scores for all emotional classes using Sigmoid activation. Additionally, we present a semantic embedding loss to understand the semantic meaning of textual emotions in order to improve overall performance. We evaluate the proposed method using the Evoked Expression from Videos (EEV) database on both the validation and test sets. The experimental results demonstrate that the proposed algorithm outperforms all conventional models.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    57
    References
    0
    Citations
    NaN
    KQI
    []