Abstract:
Visual Question Answering (VQA) is a complex cognitive multimodal inference problem
in which the latest techniques from the fields of NLP (Natural Language Processing)
and CV (Computer Visions) are merged with the goal of answering open-ended and
free-formed natural language questions by understanding the visual analogies. Previ ously, the performance of VQA models used to suffer when asked knowledge-based and
commonsense aware questions but it all changed with the introduction of transform ers, as transformer-based language models now possess some degree of knowledge and
commonsense implicitly. Additionally, we can also provide external knowledge explicitly
using Knowledge Reasoning and Representation (KRR) techniques to further enhance
the performance benchmarks. In order to train and benchmark these knowledge-aware
VQA models, several datasets like OK-VQA, GQA, KVQA etc. are introduced in which
questions require some sort of cognitive inference from available external knowledge.
These datasets are quite capable as they are carefully crafted for the intended purpose
but they are static as they have limited questions and images only.
This paper presents the framework which is capable of producing multiple relevant
knowledge-aware MCQs associated with each unique image, using the knowledge-rich
corpus from Wikipedia. These MCQs can be used for preparing dynamic knowledge aware VQA datasets. We can also use this framework by developing a visual learning
app to educate children in an interactive manner, especially in the remote areas of de veloping countries where they seldom get a chance to learn new concepts in a proper
school environment.