NUST Institutional Repository

A Multi-Modal Architecture for Visual Question Answering using Knowledge Graphs

Show simple item record

dc.contributor.author Maqsood, Ifrah
dc.date.accessioned 2023-01-18T11:47:19Z
dc.date.available 2023-01-18T11:47:19Z
dc.date.issued 2022
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/32276
dc.description.abstract Visual Question Answering (VQA) is the task of answering the questions based upon the image. The VQA system combines the property of computer vision and natural language processing to make a human interactive system. It involves extracting features, question tokenization and embedding and finally fusing two modalities to infer an answer. Many of the previous techniques used CNNs, RNNs LSTM, GCN to acquire visual con tents from the image. However, information other than objects that are visually seen is not catered. Humans are capable of answering the questions based on common sense knowledge. Inspiring from such capability, Knowledge base has been introduced that can be handled by intelligent systems. Acquiring outside knowledge with meaningful representation of image has always been a challenging task in VQA(Visual Question Answering). Wide variety of knowledge graphs have been created which contain facts, commonsense know ledges identities of famous personnel etc. Some of them are struc tured others are unstructured. Using these knowledge graphs, this thesis presents multi modal architecture that organizes the data of image and question into a structural form that not only detects objects in the image but also the relationship and interactions between them. The knowledge graph integration further enhances the model’s ability to learn commonsense knowledge while answering the question. Later, question guided attention mechanism facilitate the model to keep question relevant information and dis card any redundant information that may be retrieved. Lastly GRUC network using gates retrieves the useful information from scene and semantic knowledge to knowledge driven graph to output the global optimized answer. Our model outperforms state of the art architectures on FVQA dataset. The abstract level details of our model is shown with an example below in ‘Fig. 1’ en_US
dc.description.sponsorship Muhammad Moazam Fraz en_US
dc.language.iso en en_US
dc.publisher School of Electrical Engineering and Computer Sciences (SEECS) NUST en_US
dc.title A Multi-Modal Architecture for Visual Question Answering using Knowledge Graphs en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • MS [375]

Show simple item record

Search DSpace


Advanced Search

Browse

My Account