Revisiting Image Captioning via Maximum Discrepancy Competition

2021 
Abstract Image captioning is a hot research topic bridging computer vision and natural language processing during the past several decades. It has achieved great progress with the help of large-scale datasets and deep learning techniques. Though the variety of image captioning models (ICMs), the performance of ICMs have got stuck in a bottleneck judging from the publicly published results. Considering the marginal performance gains brought by recent ICMs, we raise the following question: “what about the performances of the recent ICMs achieve on in-the-wild images? To clarify this question, we compare existing ICMs by evaluating their generalization ability. Specifically, we propose a novel method based on maximum discrepancy competition to diagnose existing ICMs. Firstly, we establish a new test set containing only informative images selected by adopting maximum discrepancy competition on the existing ICMs, from an arbitrary large-scale raw image set. Secondly, a small-scale and low-cost subjective annotation experiment is conducted on the new test set. Thirdly, we rank the generalization ability of the existing ICMs by comparing their performances on the new test set. Finally, the keys of different ICMs are demonstrated based on a detailed analysis of experimental results. Our analysis yields several interesting findings, including that 1) Using simultaneously low- and high-level object features may be an effective tool to boost the generalization ability for the Transformer based ICMs. 2) Self-attention mechanism may provide better modelling ability for inter- and intra-modal data than other attention-based mechanisms. 3) Constructing an ICM with a multistage language decoder may be a promising way to improve its performance.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    56
    References
    0
    Citations
    NaN
    KQI
    []