Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
2020
Humans understand videos from both the visual and audio aspects of the data. In this work, we present a self-supervised cross-modal representation approach for learning audio-visual correspondence (AVC) for videos in the wild. After the learning stage, we explore retrieval in both cross-modal and intra-modal manner with the learned representations. We verify our experimental results on the VGGSound dataset [1], and our approach achieves promising results.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
4
References
0
Citations
NaN
KQI