NUST Institutional Repository

Enhancing Multimodal Information Extraction from Visually Rich Documents

Show simple item record

dc.contributor.author Aresha, Arshad
dc.date.accessioned 2025-01-07T06:16:26Z
dc.date.available 2025-01-07T06:16:26Z
dc.date.issued 2024
dc.identifier.issn 399763
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/48833
dc.description Supervisor: Dr. Momina Moetesum, Co Supervisor: Dr. Faisal Shafait en_US
dc.description.abstract Understanding Visually Rich Documents (VRDs) such as forms, infographics, emails, and invoices, presents significant challenges due to their complex layouts, multi-line entities, and diverse content structures. These challenges impact various applications, including data extraction, compliance checking, legal file interpretation, contract verifi cation, accounting statement analysis, medical case recognition, and educational assess ments. While transformer-based models have shown promising results in VRD tasks, they often struggle to capture hierarchical relationships and fine-grained inter-entity dependencies. This research proposes an enhanced multimodal architecture that inte grates Graph Neural Networks (GNNs), 2D positional embeddings, and a multi-stage transformer framework to capture spatial relationships and improve multi-line entity recognition effectively. By incorporating attention mechanisms and 2D embeddings, the methodology facilitates the joint analysis of textual and spatial information, en abling accurate extraction of multimodal information from complex document layouts. Experimental evaluation of benchmark datasets, including FUNSD, demonstrates the effectiveness of the proposed approach. The model achieved an F1 score of 91.35% on FUNSD, surpassing existing state-of-the-art methods. These results validate that combining GNNs with a multi-stage transformer and 2D positional embeddings sig nificantly enhances the spatial comprehension and hierarchical relationship modeling necessary for VRD tasks. This research advances the field of Visually Rich Document Understanding by providing a robust methodology for automating the extraction and interpretation of multimodal information from visually complex documents. en_US
dc.language.iso en en_US
dc.publisher SEECS NUST en_US
dc.subject Visually rich document understanding, Multimodal transformer, Positional embeddings, Document AI, Explainable AI. en_US
dc.title Enhancing Multimodal Information Extraction from Visually Rich Documents en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account