VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Hu Xu,Gargi Ghosh,Po-Yao Huang,Dmytro Okhonko,Armen Aghajanyan,Florian Metze,Luke Zettlemoyer,Christoph Feichtenhofer

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2021

Hu Xu
Gargi Ghosh
Po-Yao Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Christoph Feichtenhofer

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.

Keywords:

Artificial intelligence
Natural language processing
action
zero
transformer
shot
Segmentation
Code (cryptography)
Computer science
k-nearest neighbors algorithm

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations