dc.description.abstract |
Visual Question Answering (VQA) is a multidisciplinary task in the field of Natural
Language Processing (NLP) and Computer Vision (CV) in which an image and a ques tion are given to a VQA system which is responsible to find the most appropriate answer
in the image, given question can be open-ended or multiple-choice. The VQA system is
used for a variety of real-world applications, such as providing situational information
based on visual material, making judgments using a vast quantity of surveillance data,
interacting with robots, and helping individuals who are blind or visually impaired.
The goal of this study is to improve performance of VQA techniques in order to enhance
the importance of VQA in real life. Fact-based VQA (F-VQA) add external information
with visual information, in order to accomplish a general VQA. Existing FVQA methods
have the drawback of combining all types of data in the absence of fine-grained selection,
while reasoning the ultimate answer, which makes unexpected noises. The capability to
capture question-oriented and information-complementary evidence is critical to solv ing the problem. We represent an image in this thesis as a multi-modal heterogeneous
network with numerous information layers relating to the features of different graph
modalities semantic, visual, and factual. On basis of multi-layer graph representations,
we introduce a modality-aware heterogeneous graph convolutional network to capture
information from multiple information layers. Particularly, intra modal-knowledge at tention fetch most relevant information from specific modality, whereas in order to fetch
relevant information across various information-modalities we introduced inter (cross)
modal-knowledge attention. our approach based on multiple reasoning process and pre dicts the best answer by assessing all information guided by the question, by stacking
this procedure several times, over the FVQA dataset we archive state-of-the-art results
by competing more than 8% accuracy, which demonstrates the usefulness and inter pretability of our approach. |
en_US |