DCT-net: A deep co-interactive transformer network for video temporal grounding

2021 
Abstract Language-guided video temporal grounding is to temporally localize the best matched video segment in an untrimmed long video according to a given natural language query (sentence). The main challenge in this task lies in how to fuse visual and linguistic information effectively. Recent works have shown that the attention mechanism is beneficial to the multi-modal feature fusion process. In this paper, we present a concise yet valid Deep Co-Interactive Transformer Network (DCT-Net) which repurposes a Transformer-style architecture to sufficiently model cross modality interactions. It consists of Co-Interactive Transformer (CIT) layers cascaded in depth for multi-step interactions between a video-sentence pair. With the help of the proposed CIT layer, both visual and language features can share the mutually improved benefits from each other. Extensive experiments on two public datasets, i.e. ActivityNet-Caption and TACOS, demonstrate the effectiveness of our proposed model compared to state-of-the-art methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    51
    References
    0
    Citations
    NaN
    KQI
    []