Enhancing Multimodal Information Extraction from Visually Rich Documents

Aresha, Arshad

DSpace Home
→
E-Theses
→
SEECS
→
Data Science
→
MS
→
View Item

dc.contributor.author	Aresha, Arshad
dc.date.accessioned	2025-01-07T06:16:26Z
dc.date.available	2025-01-07T06:16:26Z
dc.date.issued	2024
dc.identifier.issn	399763
dc.identifier.uri	http://10.250.8.41:8080/xmlui/handle/123456789/48833
dc.description	Supervisor: Dr. Momina Moetesum, Co Supervisor: Dr. Faisal Shafait	en_US
dc.description.abstract	Understanding Visually Rich Documents (VRDs) such as forms, infographics, emails, and invoices, presents significant challenges due to their complex layouts, multi-line entities, and diverse content structures. These challenges impact various applications, including data extraction, compliance checking, legal file interpretation, contract verifi cation, accounting statement analysis, medical case recognition, and educational assess ments. While transformer-based models have shown promising results in VRD tasks, they often struggle to capture hierarchical relationships and fine-grained inter-entity dependencies. This research proposes an enhanced multimodal architecture that inte grates Graph Neural Networks (GNNs), 2D positional embeddings, and a multi-stage transformer framework to capture spatial relationships and improve multi-line entity recognition effectively. By incorporating attention mechanisms and 2D embeddings, the methodology facilitates the joint analysis of textual and spatial information, en abling accurate extraction of multimodal information from complex document layouts. Experimental evaluation of benchmark datasets, including FUNSD, demonstrates the effectiveness of the proposed approach. The model achieved an F1 score of 91.35% on FUNSD, surpassing existing state-of-the-art methods. These results validate that combining GNNs with a multi-stage transformer and 2D positional embeddings sig nificantly enhances the spatial comprehension and hierarchical relationship modeling necessary for VRD tasks. This research advances the field of Visually Rich Document Understanding by providing a robust methodology for automating the extraction and interpretation of multimodal information from visually complex documents.	en_US
dc.language.iso	en	en_US
dc.publisher	SEECS NUST	en_US
dc.subject	Visually rich document understanding, Multimodal transformer, Positional embeddings, Document AI, Explainable AI.	en_US
dc.title	Enhancing Multimodal Information Extraction from Visually Rich Documents	en_US
dc.type	Thesis	en_US