Transformer-based Label Set Generation for Multi-modal Multi-label Emotion Detection

2020 
Multi-modal utterance-level emotion detection has been a hot research topic in both multi-modal analysis and natural language processing communities. Different from traditional single-label multi-modal sentiment analysis, typical multi-modal emotion detection is naturally a multi-label problem where an utterance often contains multiple emotions. Existing studies normally focus on multi-modal fusion only and transform multi-label emotion classification into multiple binary classification problem independently. As a result, existing studies largely ignore two kinds of important dependency information: (1) Modality-to-label dependency, where different emotions can be inferred from different modalities, that is, different modalities contribute differently to each potential emotion. (2) Label-to-label dependency, where some emotions are more likely to coexist than those conflicting emotions. To simultaneously model above two kinds of dependency, we propose a unified approach, namely multi-modal emotion set generation network (MESGN) to generate an emotion set for an utterance. Specifically, we first employ a cross-modal transformer encoder to capture cross-modal interactions among different modalities, and a standard transformer encoder to capture temporal information for each modality-specific sequence given previous interactions. Then, we design a transformer-based discriminative decoding module equipped with modality-to-label attention to handle the modality-to-label dependency. In the meanwhile, we employ a reinforced decoding algorithm with self-critic learning to handle the label-to-label dependency. Finally, we validate the proposed MESGN architecture on a word-level aligned and unaligned multi-modal dataset. Detailed experimentation shows that our proposed MESGN architecture can effectively improve the performance of multi-modal multi-label emotion detection.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    7
    Citations
    NaN
    KQI
    []