AsyNCE: Disentangling False-Positives for Weakly-Supervised Video Grounding

2021 
Weakly-supervised video grounding has been investigated to ground textual phases in video content with only video-sentence pairs provided during training, for the lack of prohibitively costly bounding box annotations. Existing methods cast this task into a frame-level multiple instance learning (MIL) problem with the ranking loss. While an object might appear sparsely across multiple frames, causing uncertain false-positive frames. Thus, directly computing the average loss of all frames is inadequate in video domain. Moreover, the positive and negative pairs are equally coupling in ranking loss, so that it is impossible to handle false-positive frames individually. Additionally, naive inner production is suboptimal for the similarity measure of cross domains. To solve these issues, we propose a novel AsyNCE loss to flexibly disentangle the positive pairs from negative ones in frame-level MIL, which allows for mitigating the uncertainty of false-positive frames effectively. Besides, a cross-modal transformer block is introduced to purify the text feature by image frame context, generating a visual-guided text feature for better similarity measure. Extensive experiments on YouCook2, RoboWatch and WAB datasets demonstrate the superiority and robustness of our method over state-of-the-art methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    0
    Citations
    NaN
    KQI
    []