Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning

2022 
Despite considerable progress, image captioning still suffers from the huge difference in quality between easy and hard examples, which is left unexploited in existing methods. To address this issue, we explore the hard example mining in image captioning, and propose a simple yet effective mechanism to instruct the model to pay more attention to hard examples, thereby improving the performance in both general and complex scenarios. We first propose a novel learning strategy, termed Metric-oriented Focal Mechanism (MFM), for hard example mining in image captioning. Differing from the existing strategies for classification tasks, MFM can adopt the generative metrics of image captioning to measure the difficulties of examples, and then up-weight the rewards of hard examples during training. To make MFM applicable to different datasets without tedious parameter tuning, we further introduce an adaptive reward metric called Effective CIDEr (ECIDEr), which considers the data distribution of easy and hard examples during reward estimation. Extensive experiments are conducted on the MS COCO benchmark, and the results show that while maintaining the performance on simple examples, MFM can significantly improve the quality of captions for hard examples. The ECIDEr-based MFM is equipped on the current SOTA method, e.g. , DLCT (Luo et al. , 2021), which outperforms all existing methods and achieves new state-of-the-art performance on both the off-line and on- line testing, i.e. , 134.3 CIDEr for the off-line testing and 136.1 for the on- line testing of MSCOCO. To validate the generalization ability of ECIDEr-based MFM, we also apply it to another dataset, namely Flickr30k, and superior performance gains can also be obtained.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []