Overcoming Language Priors In Visual Question Answering With Adversarial Regularization

Authors:
Sainandan Ramakrishnan Georgia Institute of Technology
Aishwarya Agrawal Georgia Institute of Technology
Stefan Lee Georgia Institute of Technology

Introduction:

Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training -- eg overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image.

Abstract:

Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training -- \eg overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings.

You may want to know: