Align R-CNN: A Pairwise Head Network for Visual Relationship Detection

2021 
Scene graphs connect individual objects with visual relationships. They serve as a comprehensive scene representation for downstream multimodal tasks. However, by exploring recent progress in Scene Graph Generation (SGG), we find that the performance of recent works is highly limited by the pairwise relationship modeling by naive feature concatenation. Such pairwise features lack sufficient object interaction due to the mis-aligned object parts, resulting in non-discriminative pairwise features for visual relationship prediction. For example, nave concatenated pairwise feature usually make the model fail to discriminate between riding and feeding for object pair person and horse. To this end, we design a meta-architecture learning-to-align for dynamic object feature concatenation. We call our model: Align R-CNN. Specifically, we introduce a novel attention-based multiple region alignment module that can be jointly optimized with SGG. Experiments on the large-scale SGG benchmark Visual Genome show that the proposed Align R-CNN can replace the naive feature concatenation and thus boost all the existing SGG methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []