Relational Attention with Textual Enhanced Transformer for Image Captioning

2021 
Image captioning has attracted extensive research interests in recent years, which aims to generate a natural language description of an image. However, many approaches focus only on individual target object information without exploring the relationship between objects and the surrounding. It will greatly affect the performance of captioning models. In order to solve the above issue, we propose a relation model to incorporate relational information between objects from different levels into the captioning model, including low-level box proposals and high-level region features. Moreover, Transformer-based architectures have shown great success in image captioning, where image regions are encoded and then attended into attention vectors to guide the caption generation. However, the attention vectors only contain image-level information without considering the textual information, which fails to expand the capability of captioning in both visual and textual domains. In this paper, we introduce a Textual Enhanced Transformer (TET) to enable addition of textual information into Transformer. There are two modules in TET: text-guided Transformer and self-attention Transformer. The two modules perform semantic and visual attention to guide the decoder to generate high-quality captions. We extensively evaluate model on MS COCO dataset and it achieves 128.7 CIDEr-D score on Karpathy split and 126.3 CIDEr-D (c40) score on official online evaluation server.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    28
    References
    0
    Citations
    NaN
    KQI
    []