Show, Rethink, And Tell: Image Caption Generation With Hierarchical Topic Cues

2021 
Current state-of-the-art approaches for image captioning mainly apply the encoder-decoder framework with attention mechanisms, most of which ignore interactions between different types of image features and perform attention operations only once per word. The mentioned problems limit the captioning model’s capability to capture sufficient information to generate high-quality captions. By contrast, humans often rethink to polish up descriptions by re-focusing on more correct and important information, which is hard to capture at first glance. In this paper, we introduce a novel topic-guided captioning model to imitate such a human’s rethinking process by modeling interactions between visual and hierarchical semantic features of topics. To the best of our knowledge, we are the first to effectively consider hierarchical semantic features as guidance to facilitate visual attention, achieving human-like rethinking for captioning. Extensive experiments on the MS COCO dataset show that our proposed model achieves superior performance over state-of-the-art methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []