Deep Mul Timodal Learning for Emotion Recognition in Spoken Language

Yue Gu,Shuhong Chen,Ivan Marsic

Deep Mul Timodal Learning for Emotion Recognition in Spoken Language

2018

Yue Gu
Shuhong Chen
Ivan Marsic

In this paper, we present a novel deep multimodal framework to predict human emotions based on sentence-level spoken language. Our architecture has two distinctive characteristics. First, it extracts the high-level features from both text and audio via a hybrid deep multimodal structure, which considers the spatial information from text, temporal information from audio, and high -level associations from low-level handcrafted features. Second, we fuse all features by using a three-layer deep neural network to learn the correlations across modalities and train the feature extraction and fusion modules together, allowing optimal global fine-tuning of the entire structure. We evaluated the proposed framework on the IEMOCAP dataset. Our result shows promising performance, achieving 60.4% in weighted accnrar-v for five emotion categories.

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations