VideoWhisper: Toward Discriminative Unsupervised Video Feature Learning With Attention-Based Recurrent Neural Networks

2017 
We present V ideo W hisper , a novel approach for unsupervised video representation learning. Based on the observation that the frame sequence encodes the temporal dynamics of a video (e.g., object movement and event evolution), we treat the frame sequential order as a self-supervision to learn video representations. Unlike other unsupervised video feature learning methods based on frame-level feature reconstruction that is sensitive to visual variance, V ideo W hisper is driven by a novel video “sequence-to-whisper” learning strategy. Specifically, for each video sequence, we use a prelearned visual dictionary to generate a sequence of high-level semantics, dubbed “whisper,” which can be considered as the language describing the video dynamics. In this way, we model V ideo W hisper as an end-to-end sequence-to-sequence learning model using attention-based recurrent neural networks. This model is trained to predict the whisper sequence and hence it is able to learn the temporal structure of videos. We propose two ways to generate video representation from the model. Through extensive experiments on two real-world video datasets, we demonstrate that video representation learned by V ideo W hisper is effective to boost fundamental multimedia applications such as video retrieval and event classification.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    40
    References
    15
    Citations
    NaN
    KQI
    []