Unpaired Image Captioning with Semantic-Constrained Self-Learning

2021 
Image captioning has been an emerging and fast-developing research topic that seeks to automatically describe an image with a natural-language sentence. Nevertheless, most existing works heavily rely on large amounts of training image-sentence pairs and therefore hinder the practical applications of captioning in the wild. In this paper, we present a novel Semantic-Constrained Self-learning (SCS) framework for image captioning that explores an iterative self-learning strategy to learn an image captioner with only unpaired image and text data. Technically, SCS consists of two stages, i.e., pseudo pair generation and captioner re-training, iteratively producing ‘`pseudo’' image-sentence pairs via a pre-trained captioner and re-training the captioner with the pseudo pairs, respectively. Particularly, both stages are guided by the recognized objects in the image, that act as semantic constraint to strengthen the semantic alignment between the input image and the output sentence. We leverage a semantic-constrained beam search for pseudo pair generation to regularize the decoding process with the recognized objects in the image via forcing the inclusion/exclusion of the recognized/irrelevant objects in the output sentence. For captioner re-training, a self-supervised triplet loss is utilized to preserve the relative semantic similarity ordering among generated sentences with regard to the input image triplets. Moreover, an object inclusion reward and an adversarial reward are adopted to encourage the inclusion of the predicted objects in the output sentence and pursue the generation of more realistic sentences during self-critical training, respectively. The experiments conducted on two types of unpaired data, i.e., dependent data (both images and captions from COCO) and independent data (images from Flickr30K and captions from COCO or Conceptual Captions) validate the superiority of our SCS. More remarkably, we obtain the best published CIDEr score to-date of 74.7% on the COCO Karpathy test split for unpaired image captioning.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    4
    Citations
    NaN
    KQI
    []