Cognitive Attention Network (CAN) for Text and Image Multimodal Visual Dialog Systems

2020 
Visual question answering and visual dialog systems are the emerging research areas in natural language processing that exploits the use of image and text modalities to convey an understanding of the contexts and attributes in a conversation as humans do in online chat platforms. These multimodal dialog techniques are enabling the extended use of chatbots in many open and vertical domains. In this paper, we propose the cognitive attention network (CAN) which is a visual dialog system capable of answering multiple user questions regarding an image, and also able to identify similar images from past conversations and referring to them during an ongoing question-answering (Q&A) chat. Our model comprises of faster RCNN, pre-trained BERT, late data fusion, and a memory network serving as a knowledge base for the temporary storage of previous visio-textual dialog data representations. Training on VISDIAL v1.0 benchmark dataset, we achieve a competitive result that outperforms some of the existing state-of-the-art models.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    0
    Citations
    NaN
    KQI
    []