Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery

Yakoub Bazi,Mohamad Mahmoud Al Rahhal,Mohamed Lamine Mekhalfi,Mansour Abdulaziz Al Zuair,Farid Melgani

Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery

2022

Recently, vision-language models based on transformers are gaining popularity for joint modeling of visual and textual modalities. In particular, they show impressive results when transferred to several downstream tasks such as zero and few-shot classification. In this article, we propose a visual question answering (VQA) approach for remote sensing images based on these models. The VQA task attempts to provide answers to image-related questions. In contrast, VQA has gained popularity in computer vision, in remote sensing, it is not widespread. First, we use the contrastive language image pretraining (CLIP) network for embedding the image patches and question words into a sequence of visual and textual representations. Then, we learn attention mechanisms to capture the intradependencies and interdependencies within and between these representations. Afterward, we generate the final answer by averaging the predictions of two classifiers mounted on the top of the resulting contextual representations. In the experiments, we study the performance of the proposed approach on two datasets acquired with Sentinel-2 and aerial sensors. In particular, we demonstrate that our approach can achieve better results with reduced training size compared with the recent state-of-the-art.

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations