Local-to-Global Semantic Supervised Learning for Image Captioning

2020 
Image captioning is a challenging problem owing to the complexity of image content and the diverse ways of describing the content in natural language. Although current methods have made substantial progress in terms of objective metrics (such as BLEU, METEOR, ROUGE-L and CIDEr), there still exist some problems. Specifically, most of these methods are trained to maximize the log-likelihood or objective metrics. As a result, these methods often generate rigid and semantically incomplete captions. In this paper, we develop a new model that aims to generate captions conforming to human evaluation. The core idea is to use local-to-global semantic supervised learning by introducing the two-level optimization objective functions. At the word level, we match each word to the image regions using the local attention objective function; at the sentence level, we align the entire sentence and the image using the global semantic objective function. Experimentally, we compare the proposed model with current methods on MSCOCO dataset. We show that either local attention supervision or global semantic supervision is the necessary component for the success of our model through ablation studies. Furthermore, combining these two supervision objective functions achieves state-of-the-art performance in terms of both standard evaluation metrics and human judgment.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []