Structural Semantic Adversarial Active Learning for Image Captioning

2020 
Most image captioning models achieve superior performances with the help of large-scale surprised training data, but it is prohibitively costly to label the image captions. To solve this problem, we propose a structural semantic adversarial active learning (SSAAL) model that leverages both visual and textual information for deriving the most representative samples while maximizing the image captioning performance. SSAAL consists of a semantic constructor, a snapshot& caption (SC) supervisor, and a labeled/unlabeled state discriminator. The constructor is designed to generate a structural semantic representation describing the objects, attributes and object relationships in the image. The SC supervisor is proposed to supervise this representation at the word-level and sentence-level in a multi-task learning manner, which directly relates the representation to ground-truth captions and updates it in the caption generating process. Finally, we introduce a state discriminator to predict the sample state and select images with sufficient semantic and fine-grained diversity. Extensive experiments on standard captioning dataset show that our model outperforms other active learning methods and achieves a competitive performance even though selecting a small amount of samples.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    2
    Citations
    NaN
    KQI
    []