Multi-Scale Progressive Attention Network for Video Question Answering

Zhicheng Guo,Jiaxuan Zhao,Licheng Jiao,Xu Liu,Li Lingling

Multi-Scale Progressive Attention Network for Video Question Answering

2021

Understanding the multi-scale visual information in a video is essential for Video Question Answering (VideoQA). Therefore, we propose a novel Multi-Scale Progressive Attention Network (MSPAN) to achieve relational reasoning between cross-scale video information. We construct clips of different lengths to represent different scales of the video. Then, the clip-level features are aggregated into node features by using max-pool, and a graph is generated for each scale of clips. For cross-scale feature interaction, we design a message passing strategy between adjacent scale graphs, i.e., top-down scale interaction and bottom-up scale interaction. Under the question’s guidance of progressive attention, we realize the fusion of all-scale video features. Experimental evaluations on three benchmarks: TGIF-QA, MSVD-QA and MSRVTT-QA show our method has achieved state-of-the-art performance.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations