Skeleton-based Action Recognition with Multi-scale Spatial-temporal Convolutional Neural Network

2021 
The skeleton data convey significant information for human action recognition since they can robustly accommodate cluttered background and illumination variation. Early convolutional neural networks (CNN) based method mainly structure the skeleton sequence into pseudo-image and feed it into image classification neural network such as Resnet, which can not capture comprehensive spatial-temporal feature. Recently, graph convolutional networks (GCNs) have obtained superior performance. However, the computational complexity of GCN-based methods is quite high, some works even reach 100 GFLOPs for one action sample. This is contrary to the highly condensed attributes of skeleton data. In this paper, a Multi-scale Spatial-temporal Convolution Neural Network (MSST-Net) is proposed for skeleton-based action recognition. Our MSST-Net abandons complex graph convolutions and takes the implicit complementary advantages across different scales of spatial-temporal representations, which are often ignored in the previous work. On two datasets for action recognition, MSST-Net achieves impressive recognition accuracy with a small amount of calculation.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    0
    Citations
    NaN
    KQI
    []