Dense Video Captioning with Hierarchical Attention-Based Encoder-Decoder Networks

2021 
Dense video captioning is a challenging task with the goal of localizing and describing all events in an untrimmed video, taking into account both visual and text information. Although existing methods have made some achievements, most of them suffer from missing details and inferior captioning. Recent progress has been made in using object features to supplement more detailed information. However, due to the considerable number of objects in the video, the representation of learning objects is often noisy, which may interfere with the generation of correct captions. We also notice that realworld video-text data involve different granularity levels, such as objects/words and events/sentences. Therefore, we propose the hierarchical video-text attention-based encoder-decoder networks for dense video captioning. The proposed method successfully considers the hierarchy in the video and text and exploits the most relevant visual and text features when generating caption. Specially, we design a hierarchical attention encoder for learning complex visual information: an object attention module focusing on the most relevant objects and an event attention module modeling the long-range temporal context. A corresponding decoder has been built for translating multi-level features into the linguistic description, i.e., a word attention module to exploit the most correlated text features and a sentence attention module to leverage high-level semantic information. The proposed hierarchical attention mechanism achieves state-of-the-art performance on the ActivityNet Captions dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    37
    References
    0
    Citations
    NaN
    KQI
    []