Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning.

2021 
It has received increasing attention to exploiting temporal visual features and corresponding descriptions in video captioning. Most of the existing models generate the captioning words merely depend on video temporal structure, ignoring fine-grained complete scene information. And the traditional long-short term memory (LSTM) in recent models is used as decoder to generate sentences, the last generated states in previous hidden ones are used to do work directly to predict the current word. This may lead to the predicted word highly related to the last generated hidden state other than the overall context information. To model the temporal aspects of activities typically shown in the video and better capture long-range context information, we propose a novel video captioning framework via context-guided semantic features model (CSF). Specifically, to maximum information flow, several previous and future information are aggregated to guide the current token by the semantic loss in the encoding and decoding phase respectively. The visual and linguistic information are corrected by fusing the surrounding information. Extensive experiments conducted on MSVD and MSR-VTT video captioning datasets demonstrate the effectiveness of our method compared with state-of-the-art approaches.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    1
    Citations
    NaN
    KQI
    []