Local-to-Global Semantic Supervised Learning for Image Captioning

Juan Wang,Yiping Duan,Xiaoming Tao,Jianhua Lu

Local-to-Global Semantic Supervised Learning for Image Captioning

2020

Image captioning is a challenging problem owing to the complexity of image content and the diverse ways of describing the content in natural language. Although current methods have made substantial progress in terms of objective metrics (such as BLEU, METEOR, ROUGE-L and CIDEr), there still exist some problems. Specifically, most of these methods are trained to maximize the log-likelihood or objective metrics. As a result, these methods often generate rigid and semantically incomplete captions. In this paper, we develop a new model that aims to generate captions conforming to human evaluation. The core idea is to use local-to-global semantic supervised learning by introducing the two-level optimization objective functions. At the word level, we match each word to the image regions using the local attention objective function; at the sentence level, we align the entire sentence and the image using the global semantic objective function. Experimentally, we compare the proposed model with current methods on MSCOCO dataset. We show that either local attention supervision or global semantic supervision is the necessary component for the success of our model through ablation studies. Furthermore, combining these two supervision objective functions achieves state-of-the-art performance in terms of both standard evaluation metrics and human judgment.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations