Fine-Grained Unbalanced Interaction Network for Visual Question Answering

2021 
Learning an effective interaction mechanism is important for Visual Question Answering (VQA). It requires an understanding of both the visual content of images and the textual content of questions. Existing approaches consider both the inter-modal and intra-modal interactions, while neglecting the irrelevant information in the interactions. In this paper, we propose a novel Fine-grained Unbalanced Interaction Network (FUIN) to adaptively capture the most useful information from interactions. It contains a parallel interaction module to model the two-way interactions and a fine-grained adaptive activation module to adaptively activate the interactions for each component according to their specific context. Experimental evaluation results on the benchmark VQA-v2 dataset demonstrate that FUIN achieves state-of-the-art VQA performance, we achieve an overall accuracy of 71.14% on the test-std set.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    0
    Citations
    NaN
    KQI
    []