Global Relation-Aware Attention Network for Image-Text Retrieval

Jie Cao,Shengsheng Qian,Huaiwen Zhang,Quan Fang,Changsheng Xu

Global Relation-Aware Attention Network for Image-Text Retrieval

2021

The cross-modal image-text retrieval has attracted extensive attention in recent years, which contributes to the development of search engine. Fine-grained features and cross-attention have been widely used in past researches to reach the goal of cross-modal image-text matching. Although cross-related methods have achieved remarkable results, the features must be encoded again in evaluation phase due to the interaction of the two modalities, which is unsuitable for actual scenarios of search engine development. In addition, the aggregated feature does not contain sufficient semantics since it is merely obtained by simple mean pooling. Furthermore, connecting weights of self-attention blocks are target position invariant, which lacks the expected adaptability. To tackle these limitations, in this paper, we propose a novel Global Relation-aware Attention Network (GRAN) for image-text retrieval by designing Global Attention Module (GAM) and Relation-aware Attention Module (RAM) which play an important role in modeling the global feature and the relationships of local fragments. Firstly, we propose Global Attention Module (GAM) followed the fine-grained features to obtain meaningful global feature. Secondly, we use several stacked transformer encoders to further encode features separately. Finally, we propose Relation-aware Attention Module (RAM) to generate a vector which represents the relation information to infer the attention intensity of pairwise fragments. The local features, the global feature, and their relations are considered jointly to conduct an efficient image-text retrieval. Extensive experiments are conducted on the benchmark datasets of Flickr30K and MSCOCO, demonstrating the superiority of our method. On the Flickr30K, compared to the state-of-the-art method TERAN, we improve Recall@K(K=1) metric by 5.8% and 4.0 on the image and text retrieval tasks, respectively.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations