Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

2022 
Video object detection (VOD) focuses on detecting objects for each frame in a video, which is a challenging task due to appearance deterioration in certain video frames. Recent works usually distill crucial information from multiple support frames to improve the reference features, but they only perform at frame level or proposal level that cannot integrate spatial-temporal features sufficiently. To deal with this challenge, we treat VOD as a spatial-temporal hierarchical features interacting process and introduce a Multi-level Spatial-Temporal (MST) feature aggregation framework to fully exploit frame-level, proposal-level, and instance-level information in a unified framework. Specifically, MST first measures context similarity in pixel space to enhance all frame-level features rather than only update reference features. The proposal-level feature aggregation then models object relation to augment reference object proposals. Furthermore, to filter out irrelevant information from other classes and backgrounds, we introduce an instance ID constraint to boost instance-level features by leveraging support object proposal features that belong to the same object. Besides, we propose a Deformable Feature Alignment (DAlign) module before MST to achieve a more accurate pixel-level spatial alignment for better feature aggregation. Extensive experiments are conducted on ImageNet VID and UAVDT datasets that demonstrate the superiority of our method over state-of-the-art (SOTA) methods. Our method achieves 83.3% and 62.1% with ResNet-101 on two datasets, outperforming SOTA MEGA by 0.4% and 2.7%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []