STAE: A Spatiotemporal Auto-Encoder for High-Resolution Video Prediction

2021 
Predicting high-resolution videos (≥ 256) is always a very difficult task in video prediction domain. To predict high-quality frames for high-resolution videos, both the challenging spatiotemporal representations and the computation resources are needed to be carefully considered. In this paper, we propose a SpatioTemporal Auto-Encoder for High-Resolution Video Prediction, which is named STAE. In our method, we first jointly utilize the spatial and temporal encoders to extract low-dimensional spatial and temporal features from the high-resolution video input, which can preserve the spatiotemporal information from the input and significantly reduce the computation load for the following modules. In addition, we design a SpatioTemporal Attention based Memory (STAM) to predict the spatiotemporal features for future frames using the encoded low-dimensional features. Then the predicted spatial and temporal features are decoded back to the high-dimensional data space using the spatial and temporal decoders. Finally, the predicted high-dimensional spatial and temporal representations are jointly utilized to predict the future frames. All modules in STAE are built on the basis of 3D neural networks to improve the local perception to videos. Experimental results show the proposed method outperforms diverse state-of-the-arts on widely used datasets and the computation load is relative low.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []