A Multi-Modal Architecture for Visual Question Answering using Knowledge Graphs

Maqsood, Ifrah

DSpace Home
→
E-Theses
→
SEECS
→
Computer Science
→
MS
→
View Item

A Multi-Modal Architecture for Visual Question Answering using Knowledge Graphs

Maqsood, Ifrah

URI: http://10.250.8.41:8080/xmlui/handle/123456789/32276

Date: 2022

Abstract:

Visual Question Answering (VQA) is the task of answering the questions based upon the image. The VQA system combines the property of computer vision and natural language processing to make a human interactive system. It involves extracting features, question tokenization and embedding and finally fusing two modalities to infer an answer. Many of the previous techniques used CNNs, RNNs LSTM, GCN to acquire visual con tents from the image. However, information other than objects that are visually seen is not catered. Humans are capable of answering the questions based on common sense knowledge. Inspiring from such capability, Knowledge base has been introduced that can be handled by intelligent systems. Acquiring outside knowledge with meaningful representation of image has always been a challenging task in VQA(Visual Question Answering). Wide variety of knowledge graphs have been created which contain facts, commonsense know ledges identities of famous personnel etc. Some of them are struc tured others are unstructured. Using these knowledge graphs, this thesis presents multi modal architecture that organizes the data of image and question into a structural form that not only detects objects in the image but also the relationship and interactions between them. The knowledge graph integration further enhances the model’s ability to learn commonsense knowledge while answering the question. Later, question guided attention mechanism facilitate the model to keep question relevant information and dis card any redundant information that may be retrieved. Lastly GRUC network using gates retrieves the useful information from scene and semantic knowledge to knowledge driven graph to output the global optimized answer. Our model outperforms state of the art architectures on FVQA dataset. The abstract level details of our model is shown with an example below in ‘Fig. 1’

Show full item record