Temporal Relations Matter: A Two-Pathway Network for Aerial Video Recognition
With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current researches mainly focus on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies which are important for describing complicated dynamics. In this paper, we propose a novel two-pathway network to model not only holistic features, but also temporal relations for aerial video classification. More specially, our model employs a two-pathway architecture: (1) a holistic representation pathway to learn a general feature of frame appearances and short-term temporal variations and (2) a temporal relation pathway to capture multi-scale temporal relations across arbitrary frames, providing long-term temporal dependencies. Our model is evaluated on event recognition dataset, ERA, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity.