Memory-attended semantic context-aware network for video captioning

2021 
Generating video captioning automatically is an active and flouring research topic that involves the complex interactions between visual features and natural language generation. The attention mechanism obtains the key visual information corresponding to the word by removing redundant information. However, existing visual attention methods are indirectly guided by the hidden state of language model, ignoring the interactions between visual features obtained by attention mechanisms. Due to the existence of incomplete object or interference noise, attention mechanism with frame feature is hard to find correct regions-of-interest which closely related to the motion state. Worse still, at each time step, the hidden states have no access to the posterior decode states. The future predicted information is not fully utilized, which lead to the lack of detailed context-aware information. In this paper, we propose a novel video captioning framework with Memory-attended Semantic Context-aware Network (MaSCN) to capture the adjacent sequential dependency across multiple time stamps between different outputs for visual features. To exploit pivotal feature from coarse-grained to fine-grained, we introduce the attention module in MaSCN, which uses corresponding tailored Visual Semantic LSTM(VSLSTM) layers to more precisely map visual relationship information through multi-level attention mechanism. Besides, we integrate the visual features obtained through the attention mechanism as a late fusion. The visual semantic loss is used to explicitly memorize contextual information, capturing the fine-grained detailed cues. Compared with state-of-the-art approaches, the extensive experiments demonstrate the effectiveness of our method on MSVD and MSR-VTT datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    43
    References
    0
    Citations
    NaN
    KQI
    []