Integrating Historical States and Co-attention Mechanism for Visual Dialog

2021 
Visual dialog is a typical multi-modal task which involves both vision and language. Nowadays, it faces two major difficulties. In this paper, we propose the Integrating Historical States and Co-attention (HSCA) for visual dialog to solve them. It includes two main modules, Co-ATT and MATCH. Specifically, the main purpose of the Co-ATT module is to guide the image with questions and answers in the early stage to get more specific objects. It tackles the first difficulty of the temporal sequence issue in historical information which may influence the precise answer for multi-round questions. The MATCH module is, based on a question with pronouns, to retrieve the best matching historical information block. It overcomes the second difficulty of the visual reference problem which requires to solve pronouns referring to unknowns in the text message and then to locate the objects in the given image. We quantitatively and qualitatively evaluate our model on VisDial v1.0, at the same time, ablation studies are carried out. The experimental results demonstrate that HSCA outperforms the state-of-the-art methods in many aspects.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    0
    Citations
    NaN
    KQI
    []