Abstract:
Visual Question Answering (VQA) is the task of answering the questions based upon the
image. The VQA system combines the property of computer vision and natural language
processing to make a human interactive system. It involves extracting features, question
tokenization and embedding and finally fusing two modalities to infer an answer.
Many of the previous techniques used CNNs, RNNs LSTM, GCN to acquire visual con tents from the image. However, information other than objects that are visually seen
is not catered. Humans are capable of answering the questions based on common sense
knowledge. Inspiring from such capability, Knowledge base has been introduced that
can be handled by intelligent systems. Acquiring outside knowledge with meaningful
representation of image has always been a challenging task in VQA(Visual Question
Answering). Wide variety of knowledge graphs have been created which contain facts,
commonsense know ledges identities of famous personnel etc. Some of them are struc tured others are unstructured. Using these knowledge graphs, this thesis presents multi modal architecture that organizes the data of image and question into a structural form
that not only detects objects in the image but also the relationship and interactions
between them. The knowledge graph integration further enhances the model’s ability
to learn commonsense knowledge while answering the question. Later, question guided
attention mechanism facilitate the model to keep question relevant information and dis card any redundant information that may be retrieved. Lastly GRUC network using
gates retrieves the useful information from scene and semantic knowledge to knowledge
driven graph to output the global optimized answer. Our model outperforms state of
the art architectures on FVQA dataset. The abstract level details of our model is shown
with an example below in ‘Fig. 1’