Learning Semantic-Aware Spatial-Temporal Attention for Interpretable Action Recognition

2022 
Human beings can concentrate on the most semantically relevant visual information when performing action recognition, so as to make reasonable and interpretable predictions. However, most existing approaches, which are applied to address visual tasks, neglect to explicitly imitate such ability for improving the performance and reliability of models. In this paper, we propose an interpretable action recognition framework that can not only improve the performance but also enhance the visual interpretability of 3D CNNs. Specifically, we design a semantic-aware attention module to learn correlative spatial-temporal attention for different action categories. To further leverage the rich semantics of features extracted from different layers, we design a hierarchical semantic fusion module with the help of the learned attention. The proposed two modules can enhance and complement each other, meanwhile, the semantic-aware attention module enjoys the plug-and-play merit. We evaluate our method on different benchmarks with comprehensive ablation studies and visualization analysis. Experimental results demonstrate the effectiveness of our method, showing favorable accuracy against state-of-the-arts while enhancing the semantic interpretability (Code will be available at this link https://github.com/PHDJieFu ).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []