Auxiliary Bi-Level Graph Representation for Cross-Modal Image-Text Retrieval

2021 
Image-text retrieval is one of the most common tasks in multimodal retrieval. It suffers from the problem of information imbalance between modalities, which is so-called modality gap. It remains challenging because prior methods cannot bridge the gap reasonably. With the help of scene graph, we start by designing an auxiliary bi-level graph representation (ABGR) pipeline that can fully mine the potential information and reduce the information redundancy. By doing so, each modality will be represented by lexical word graph that carries the main content of the information. Specifically, we design a graph feature enhancement (GFE) module to embed the graph-structured information in a common subspace while exploring the relationship between lexical words. As a result, a better representation for both image and text can be obtained, which helps us to evaluate the similarity between images and texts more reasonably. Experimental results conducted on two benchmark datasets Flickr30K and MS-COCO demonstrate the effectiveness of our proposed model for cross-modal retrieval task.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []