Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos

Zishun Feng,Ming Tu,Rui Xia,Yuxuan Wang,Ashok Krishnamurthy

Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos

2020

Zishun Feng
Ming Tu
Rui Xia
Yuxuan Wang
Ashok Krishnamurthy

Humans understand videos from both the visual and audio aspects of the data. In this work, we present a self-supervised cross-modal representation approach for learning audio-visual correspondence (AVC) for videos in the wild. After the learning stage, we explore retrieval in both cross-modal and intra-modal manner with the learned representations. We verify our experimental results on the VGGSound dataset [1], and our approach achieves promising results.

Keywords:

audio visual
Visualization
Task analysis
Feature learning
Data visualization
Big data
Computer science
Artificial intelligence
representation
Natural language processing

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations