Multi-stage Multi-modal Pre-training for Video Representation.

Chunquan Chen,Lujia Bao,Weikang Li,Xiaoshuai Chen,Xinghai Sun,Chao Qi

Multi-stage Multi-modal Pre-training for Video Representation.

2021

Multi-modal networks are usually challenging to train because of their complexity. On the one hand, multi-modal networks are often prone to underfitting due to their heterogeneous data formats of different modalities. On the other hand, data from different domains have different distributions and domain differences could be difficult to eliminate in joint training. This paper presents a Multi-Stage Multi-Modal pre-training strategy (MSMM) to train multi-modal joint representation effectively. To eliminate the difficulty of multi-modal end-to-end training, MSMM trains different Uni-modal network separately and then jointly trains multi-modal. After multi-stage pre-training, we can get a better multi-modal joint representation and better uni-modal representations. Meanwhile, we design a multi-modal network and multi-task loss to train the whole network in an end-to-end style. Extensive empirical results show that MSMM can significantly improve the multi-modal model’s performance on the video classification task.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations