A Deep Semantic Alignment Network for Cross-Modal Image-Text Retrieval in Remote Sensing

2021 
Because of the rapid growth of multi-modal data from internet and social media, cross-modal retrieval has become an important and valuable task in recent years. The purpose of cross-modal retrieval is to obtain the result data in one modality (e.g., image) which is semantically similar to the query data in another modality (e.g., text). In the field of remote sensing, despite a great number of existing works on image retrieval, there has only been a small amount of research on cross-modal image-text retrieval, due to the scarcity of datasets and the complicated characteristics of remote sensing image data. In this article, we introduce a novel cross-modal image-text retrieval network to establish the direct relationship between remote sensing images and their paired text data. Specifically, in our framework, we designed a semantic alignment module (SAM) to fully explore the latent correspondence between images and text, in which we used the attention gate mechanisms to filter and optimize data features so that more discriminative feature representations can be obtained. Experimental results on four benchmark remote sensing datasets, including UCMerced_LandUse-Captions, Sydney-Captions, RSICD and NWPU-RESISC45-Captions, well showed that our proposed method outperformed other baselines and achieved the state-of-the-art performance in remote sensing image-text retrieval tasks.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    57
    References
    5
    Citations
    NaN
    KQI
    []