Spatio-Temporal VLAD Encoding of Visual Events Using Temporal Ordering of the Mid-Level Deep Semantics

2020 
Classification of video events based on frame-level descriptors is a common approach to video recognition. In the meanwhile, proper encoding of the frame-level descriptors is vital to the whole event classification procedure. While there are some pretty efficient video descriptor encoding methods, temporal ordering of the descriptors is often ignored in these encoding algorithms. In this paper, we show that by taking into account the temporal inter-frame dependencies and tracking the chronological order of video sub-events, accuracy of event recognition is further improved. First, the frame-level descriptors are extracted using convolutional neural networks (CNNs) pre-trained on ImageNet, which are fine-tuned on a portion of training video frames. Then, a spatio-temporal encoding is applied to the derived descriptors. The proposed spatio-temporal encoding, as the main contribution of this work, is inspired from the well-known vector of locally aggregated descriptors (VLAD) encoding in spatial domain and from total variation de-noising (TVD) in temporal domain. The proposed unified spatio-temporal encoding is then shown to be in the form of a convex optimization problem which is solved efficiently with alternating direction method of multipliers (ADMM) algorithm. The experimental results show superiority of the proposed encoding method in terms of recognition accuracy over both frame-level video encoding approaches and spatio-temporal video representations. As compared to the state-of-the-art approaches, our encoding method improves the mean average precision (mAP) over both Columbia consumer video (CCV), unstructured social activity attribute (USAA), YouTube-8M, and Kinetics datasets and is very competitive on FCVID dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    113
    References
    2
    Citations
    NaN
    KQI
    []