Global Relation-Aware Attention Network for Image-Text Retrieval

2021 
The cross-modal image-text retrieval has attracted extensive attention in recent years, which contributes to the development of search engine. Fine-grained features and cross-attention have been widely used in past researches to reach the goal of cross-modal image-text matching. Although cross-related methods have achieved remarkable results, the features must be encoded again in evaluation phase due to the interaction of the two modalities, which is unsuitable for actual scenarios of search engine development. In addition, the aggregated feature does not contain sufficient semantics since it is merely obtained by simple mean pooling. Furthermore, connecting weights of self-attention blocks are target position invariant, which lacks the expected adaptability. To tackle these limitations, in this paper, we propose a novel Global Relation-aware Attention Network (GRAN) for image-text retrieval by designing Global Attention Module (GAM) and Relation-aware Attention Module (RAM) which play an important role in modeling the global feature and the relationships of local fragments. Firstly, we propose Global Attention Module (GAM) followed the fine-grained features to obtain meaningful global feature. Secondly, we use several stacked transformer encoders to further encode features separately. Finally, we propose Relation-aware Attention Module (RAM) to generate a vector which represents the relation information to infer the attention intensity of pairwise fragments. The local features, the global feature, and their relations are considered jointly to conduct an efficient image-text retrieval. Extensive experiments are conducted on the benchmark datasets of Flickr30K and MSCOCO, demonstrating the superiority of our method. On the Flickr30K, compared to the state-of-the-art method TERAN, we improve Recall@K(K=1) metric by 5.8% and 4.0 on the image and text retrieval tasks, respectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []