Multi-stage Multi-modal Pre-training for Video Representation.

2021 
Multi-modal networks are usually challenging to train because of their complexity. On the one hand, multi-modal networks are often prone to underfitting due to their heterogeneous data formats of different modalities. On the other hand, data from different domains have different distributions and domain differences could be difficult to eliminate in joint training. This paper presents a Multi-Stage Multi-Modal pre-training strategy (MSMM) to train multi-modal joint representation effectively. To eliminate the difficulty of multi-modal end-to-end training, MSMM trains different Uni-modal network separately and then jointly trains multi-modal. After multi-stage pre-training, we can get a better multi-modal joint representation and better uni-modal representations. Meanwhile, we design a multi-modal network and multi-task loss to train the whole network in an end-to-end style. Extensive empirical results show that MSMM can significantly improve the multi-modal model’s performance on the video classification task.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    0
    Citations
    NaN
    KQI
    []