Cascade Cross-modal Attention Network for Video Actor and Action Segmentation from a Sentence

2021 
In this paper, we address the problem that selectively segments the actor and its action in the video clip given the sentence description. The main challenge is to match the local semantic features of the video with the heterogeneous textual features. A widely used language processing method in previous works is to leverage bi-LSTM and self-attention, which fixed the attention of the sentence and neglected the personality of the video, leading the attention of the sentence mismatch the most discriminative feature of the video. The proposed algorithm in this paper allows the sentence to learn the most discriminative features of the video, remarkably improving the accuracy of matching and segmentation. Specifically, we propose a cascade cross-modal attention to leverage two perspectives visual features to attend language from coarse to fine to generate the discriminative vision-aware language features. Moreover, equipping our framework with a contrastive learning method and a designed hard negative mining strategy benefits our proposed network from identifying the positive sample from numbers of negatives, and further improving the performance. To demonstrate the effectiveness of our approach, we conduct experiments on two datasets: A2D Sentences and J-HMDB Sentences. Experimental results show that our method significantly improves the performance over recent state-of-the-art methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    42
    References
    0
    Citations
    NaN
    KQI
    []