Vision-and-Dialog Navigation by Fusing Cross-modal features

Nie Hongxu,Dong Min,Bi Sheng

Vision-and-Dialog Navigation by Fusing Cross-modal features

2021

Recently, research in robotics community shows a trend that the future intelligent robots should be endowed with the capacities to understand the environment and navigate to the goal location through the communications with their human users. Such tasks usually require the smart agents to process multi-modal information effectively. Though multi-modal information processing has been long studied, the problem of how to effectively fuse different modalities of information remains challenging. In this paper, we focus on the vision-and-dialog navigation(NDH) task. The NDH task is proposed for building dialog-enabled agents which can find a path to the goal location in unexplored environments by inferring navigation actions based on the dialog history and the visual inputs. We first investigate the problem about what role the visual features are playing in NDH task. We observe the same trend which was observed in the vision-and-language Navigation(VLN) task. The conclusion is using different levels of visual features affects the model performance seriously. Particularly, using low-level visual features makes the agent models hardly generalize in unseen environments (i.e., environments not used in training). Models using only high-level visual features perform better in unseen environments. However, these models suffer a significant performance drop in seen environments, which means these models can not understand and remember the seen environments thoroughly. According to this observation, we explore several ways to fuse these features. We propose a model to fuse the dialog feature with each modality of visual feature. On the other hand, the prediction of our model is an ensemble of jointly-trained models which focus on different modalities. Our proposed method can be applied in any VLN models and NDH models. Our results show that our method can improve the performance of NDH models in both seen environments and unseen environments.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations