ASTM: An Attention based Spatiotemporal Model for Video Prediction Using 3D Convolutional Neural Networks

2021 
Video prediction has always been a challenging task in video representation learning due to the diversity of spatial-temporal evolution in videos. In this paper, we propose an Attention based SpatioTemporal Model for Video Prediction based on 3D Convolutional Neural Networks and Long Short-Term Memory (LSTM), which is named ASTM. In our method, we leverage both multi-term and short-term inter-frame dependencies in temporal domain to capture reliable motion information for videos. In particular, we design an Efficient Inter-Frame Attention Gate (EIFAG) to efficiently aggregate the multi-term inter-frame dependencies and integrate 3D convolutional operations into the proposed model to further improve the local perception to videos by capturing more accurate short-term temporal dependency. In addition, we make use of the multilayer Spatiotemporal LSTM (PredRNN) structure to preserve more spatial appearance details for videos. To evaluate the adaptability of our model on more complex real scenes, we collect a multi-level spatiotemporal (MLST) dataset. Experimental results show that the proposed model can achieve state-of-the-art performance on both widely used datasets and the proposed MLST dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []