Group-Level Focus of Visual Attention for Improved Next Speaker Prediction

2021 
In this work we address the Next Speaker Prediction sub challenge of the ACM '21 MultiMediate Grand Challenge. This challenge poses the problem of turn taking prediction in physically situated multiparty interaction. Solving this problem is essential for enabling fluent real-time multiparty human-machine interaction. This problem is made more difficult by the need for a robust solution that can perform effectively across a wide variety of settings and contexts. Prior work has shown that current state-of-the-art methods rely on machine learning approaches that do not generalize well to new settings and feature distributions. To address this problem, we propose the use of group-level focus of visual attention as additional information. We show that a simple combination of group-level focus of visual attention features and publicly available audio-video synchronizer models is competitive with state-of-the-art methods fine-tuned for the challenge dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    0
    Citations
    NaN
    KQI
    []