MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques

Sihan Chen,Xinxin Zhu,Dongze Hao,Wei Liu,Jiawei Liu,Zijia Zhao,Longteng Guo,Jing Liu

MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques

2021

The quality of video representation directly decides the performance of video related tasks, for both understanding and generation. In this paper, we propose single-modality pretrained feature fusion technique which is composed of reasonable multi-view feature extraction method and designed multi-modality feature fusion strategy. We conduct comprehensive ablation studies on MSR-VTT dataset to demonstrate the effectiveness of proposed method and it surpasses the state-of-the-art methods on both MSR-VTT and VATEX datasets. We further propose the multi-modality pretrained model finetuning technique and dataset augmentation scheme to improve the model's generalization capability. Based on these two proposed pretraining techniques and dataset augmentation scheme, we win the first place in the video captioning track of the MM21 pretraining for video understanding challenge.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations