Contrastive Learning Framework by Maximizing Mutual Information for Visual Question Answering

2021 
Visual Question Answering (VQA) is an extremely stimulating and challenging research area which requires joint image content and language understanding to answer questions about a given image. The existing VQA models have made many efforts in the direction of improving the ability to understand images and achieved good results. However, these models ignore the relationship between the image and the corresponding question, which is strongly correlated when the data is collected (i.e., the question asked by the staff should be consistent with the current image). To capture the essential features between the image and the question, we propose a new VQA model based on contrastive learning by maximizing mutual information. The core idea of our model is to maximize the mutual information between question features and corresponding image features, and minimize the mutual information between question features and their irrelevant image features, so as to improve the ability of image understanding and question understanding of the model simultaneously. The experimental results indicate that the feature representation learned by our model is more representative, and the performance of our model on VQA v1.0 and VQA v2.0 datasets is better than the baseline model.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    38
    References
    0
    Citations
    NaN
    KQI
    []