Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$ Videos

2020 
The recent development of commodity 360° cameras have enabled a single video to capture an entire scene, which endows promising potentials in surveillance scenarios. However, research in omnidirectional video analysis has lagged behind the hardware advances. In this work, we address the important problem of action recognition in topview 360° videos. Due to the wide filed-of-view, 360° videos usually capture multiple people performing actions at the same time. Furthermore, the appearance of people are deformed. The proposed framework first transforms top-view omnidirectional videos into panoramic videos using a calibrationfree method. Then spatial-temporal features are extracted using region-based 3D CNNs for action recognition. We propose a weakly-supervised method based on multiinstance multi-label learning, which trains the model to recognize and localize multiple actions in a video using only video-level action labels as supervision. We perform experiments to quantitatively validate the efficacy of the proposed method over state-of-the-art baselines and variants of our model, and qualitatively demonstrate action localization results. To enable research in this direction, we introduce the 360Action dataset. It is the first omnidirectional video dataset for multi-person action recognition with a diverse set of scenes, actors and actions. The dataset is available at https://github.com/ryukenzen/360action.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    7
    Citations
    NaN
    KQI
    []