NUST Institutional Repository

Multimodal Knowledge Reasoning for Enhanced Visual Question Answering

Show simple item record

dc.contributor.author Hussain, Afzaal
dc.date.accessioned 2022-06-28T07:45:12Z
dc.date.available 2022-06-28T07:45:12Z
dc.date.issued 2022
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/29763
dc.description.abstract Visual Question Answering (VQA) is a multidisciplinary task in the field of Natural Language Processing (NLP) and Computer Vision (CV) in which an image and a ques tion are given to a VQA system which is responsible to find the most appropriate answer in the image, given question can be open-ended or multiple-choice. The VQA system is used for a variety of real-world applications, such as providing situational information based on visual material, making judgments using a vast quantity of surveillance data, interacting with robots, and helping individuals who are blind or visually impaired. The goal of this study is to improve performance of VQA techniques in order to enhance the importance of VQA in real life. Fact-based VQA (F-VQA) add external information with visual information, in order to accomplish a general VQA. Existing FVQA methods have the drawback of combining all types of data in the absence of fine-grained selection, while reasoning the ultimate answer, which makes unexpected noises. The capability to capture question-oriented and information-complementary evidence is critical to solv ing the problem. We represent an image in this thesis as a multi-modal heterogeneous network with numerous information layers relating to the features of different graph modalities semantic, visual, and factual. On basis of multi-layer graph representations, we introduce a modality-aware heterogeneous graph convolutional network to capture information from multiple information layers. Particularly, intra modal-knowledge at tention fetch most relevant information from specific modality, whereas in order to fetch relevant information across various information-modalities we introduced inter (cross) modal-knowledge attention. our approach based on multiple reasoning process and pre dicts the best answer by assessing all information guided by the question, by stacking this procedure several times, over the FVQA dataset we archive state-of-the-art results by competing more than 8% accuracy, which demonstrates the usefulness and inter pretability of our approach. en_US
dc.description.sponsorship Dr. Muhammad Moazam Fraz en_US
dc.language.iso en en_US
dc.publisher SEECS-School of Electrical Engineering and Computer Science NUST Islamabad en_US
dc.title Multimodal Knowledge Reasoning for Enhanced Visual Question Answering en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • MS [375]

Show simple item record

Search DSpace


Advanced Search

Browse

My Account