Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking

2018 
We investigate the importance of human centered visual cues for predicting the popularity of a public lecture. We construct a large database of more than 1800 TED talk videos and leverage the corresponding (online) viewers' ratings from YouTube for a measure of popularity of the TED talks. Visual cues related to facial and physical appearance, facial expressions, and pose variations are learned using convolutional neural networks (CNN) connected to an attention-based long short-term memory (LSTM) network to predict the video popularity. The proposed overall network is end-to-end-trainable, and achieves state-of-the-art prediction accuracy indicating that the visual cues alone contain highly predictive information about the popularity of a talk. We also demonstrate qualitatively that the network learns a human-like attention mechanism, which is particularly useful for interpretability, i.e. how attention varies with time, and across different visual cues as a function of their relative importance.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    3
    Citations
    NaN
    KQI
    []