RGB-skeleton fusion network for spatial-temporal action detection

2021 
Due to most of the current algorithms use stacked RGB information for spatial-temporal action detection, the time sequence information is easily lost in the process of convolution and down-sampling, which makes it difficult to model spatial-temporal action and limits the development of action detection. Given the current advanced pose estimation algorithm that has achieved good detection accuracy, we propose an end-to-end network that fuses RGB with skeleton to solve the problem of spatial-temporal action detection. We use RGB to describe the appearance information of object and skeleton to describe the action information. Specifically, in the first part, we generate the initial classification and location proposals based on RGB information by the SSD network. Secondly, we generate frame-level skeleton information by the current advanced pose estimation algorithm, the skeleton helps the SSD network to filter negative samples during training, and then we stack the skeleton after completion and normalization, put it into LSTM network for classification. Finally, we fuse the outputs of the SSD network and LSTM network. We believe that the introduction of skeleton information can effectively address the problem of the insufficient capacity of RGB information for spatialtemporal action modeling. It is worth noting that our skeleton information is based on advanced attitude estimation algorithms rather than annotated. For the datasets, we select the single-person action videos in UCF101 and UCF50. The final experimental results show that our method can significantly improve the action modeling ability of the neural network, and show effective results in action detection.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []