Multimodal Knowledge Reasoning for Enhanced Visual Question Answering

Hussain, Afzaal

DSpace Home
→
E-Theses
→
SEECS
→
Computer Science
→
MS
→
View Item

dc.contributor.author	Hussain, Afzaal
dc.date.accessioned	2022-06-28T07:45:12Z
dc.date.available	2022-06-28T07:45:12Z
dc.date.issued	2022
dc.identifier.uri	http://10.250.8.41:8080/xmlui/handle/123456789/29763
dc.description.abstract	Visual Question Answering (VQA) is a multidisciplinary task in the field of Natural Language Processing (NLP) and Computer Vision (CV) in which an image and a ques tion are given to a VQA system which is responsible to find the most appropriate answer in the image, given question can be open-ended or multiple-choice. The VQA system is used for a variety of real-world applications, such as providing situational information based on visual material, making judgments using a vast quantity of surveillance data, interacting with robots, and helping individuals who are blind or visually impaired. The goal of this study is to improve performance of VQA techniques in order to enhance the importance of VQA in real life. Fact-based VQA (F-VQA) add external information with visual information, in order to accomplish a general VQA. Existing FVQA methods have the drawback of combining all types of data in the absence of fine-grained selection, while reasoning the ultimate answer, which makes unexpected noises. The capability to capture question-oriented and information-complementary evidence is critical to solv ing the problem. We represent an image in this thesis as a multi-modal heterogeneous network with numerous information layers relating to the features of different graph modalities semantic, visual, and factual. On basis of multi-layer graph representations, we introduce a modality-aware heterogeneous graph convolutional network to capture information from multiple information layers. Particularly, intra modal-knowledge at tention fetch most relevant information from specific modality, whereas in order to fetch relevant information across various information-modalities we introduced inter (cross) modal-knowledge attention. our approach based on multiple reasoning process and pre dicts the best answer by assessing all information guided by the question, by stacking this procedure several times, over the FVQA dataset we archive state-of-the-art results by competing more than 8% accuracy, which demonstrates the usefulness and inter pretability of our approach.	en_US
dc.description.sponsorship	Dr. Muhammad Moazam Fraz	en_US
dc.language.iso	en	en_US
dc.publisher	SEECS-School of Electrical Engineering and Computer Science NUST Islamabad	en_US
dc.title	Multimodal Knowledge Reasoning for Enhanced Visual Question Answering	en_US
dc.type	Thesis	en_US