Weakly-supervised temporal attention 3D network for human action recognition

2021 
Abstract From a series of observations, we have inferred that human actions in videos are defined by a set of significant frames. In this paper, we propose a weakly-supervised temporal attention 3D network for human action recognition, called as TA3DNet, to accelerate 3D convolutional neural networks (3D CNNs) by temporally assigning different importance to each frame. First, we obtain short-term frames with long-term connection by regularly or randomly skipping frames to avoid temporal redundancy, and apply 3D convolutional layers to extract features for action recognition. Then, we apply a temporal attention module to assign different weights to each frame. We train the temporal attention module in a weakly-supervised manner that updates weights based on only class labels without event information and extra labels. Thus, TA3DNet reduces the number of input frames and constructs a lightweight network for action recognition. Experimental results demonstrate that TA3DNet achieves high performance on two challenging datasets (UCF101 and HMDB51) and outperforms state-of-the-art methods for action recognition.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    0
    Citations
    NaN
    KQI
    []