Graph Convolutional Module for Temporal Action Localization in Videos

2021 
Temporal action localization, which aims at recognizing the location and the category of action instances in videos, has long been researched. Existing methods divide each video into multiple action units (i.e., proposals in two-stage methods and segments in one-stage methods) and then perform recognition/regression on each of them individually without explicitly exploiting their relations, which, however, play an important role in action localization. In this paper, we propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods, including two-stage and one-stage paradigms. Specifically, we first construct a graph, where each action unit is represented as a node and their relations as edges. We use two types of relations, one for capturing the temporal connections, and the other one for characterizing the semantic relationship. Then, we apply graph convolutional networks (GCNs) on the graph to model the relations and learn more informative representations for action localization. Experimental results show that GCM consistently improves the performance of both two-stage action localization methods (e.g., CBR and R-C3D) and one-stage methods (e.g., D-SSAD), verifying the generality and effectiveness of GCM. Moreover, with the aid of GCM, our approach significantly outperforms the state-of-the-art on THUMOS14 and ActivityNet.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    42
    References
    0
    Citations
    NaN
    KQI
    []