A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
2021
We present a large-scale study on unsupervised spatiotemporal representation
learning from videos. With a unified perspective on four recent image-based
frameworks, we study a simple objective that can easily generalize all these
methods to space-time. Our objective encourages temporally-persistent features
in the same video, and in spite of its simplicity, it works surprisingly well
across: (i) different unsupervised frameworks, (ii) pre-training datasets,
(iii) downstream datasets, and (iv) backbone architectures. We draw a series of
intriguing observations from this study, e.g., we discover that encouraging
long-spanned persistency can be effective even if the timespan is 60 seconds.
In addition to state-of-the-art results in multiple benchmarks, we report a few
promising cases in which unsupervised pre-training can outperform its
supervised counterpart. Code is made available at
https://github.com/facebookresearch/SlowFast
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
97
References
4
Citations
NaN
KQI