dc.description.abstract |
Understanding Visually Rich Documents (VRDs) such as forms, infographics, emails,
and invoices, presents significant challenges due to their complex layouts, multi-line
entities, and diverse content structures. These challenges impact various applications,
including data extraction, compliance checking, legal file interpretation, contract verifi cation, accounting statement analysis, medical case recognition, and educational assess ments. While transformer-based models have shown promising results in VRD tasks,
they often struggle to capture hierarchical relationships and fine-grained inter-entity
dependencies. This research proposes an enhanced multimodal architecture that inte grates Graph Neural Networks (GNNs), 2D positional embeddings, and a multi-stage
transformer framework to capture spatial relationships and improve multi-line entity
recognition effectively. By incorporating attention mechanisms and 2D embeddings,
the methodology facilitates the joint analysis of textual and spatial information, en abling accurate extraction of multimodal information from complex document layouts.
Experimental evaluation of benchmark datasets, including FUNSD, demonstrates the
effectiveness of the proposed approach. The model achieved an F1 score of 91.35%
on FUNSD, surpassing existing state-of-the-art methods. These results validate that
combining GNNs with a multi-stage transformer and 2D positional embeddings sig nificantly enhances the spatial comprehension and hierarchical relationship modeling
necessary for VRD tasks. This research advances the field of Visually Rich Document
Understanding by providing a robust methodology for automating the extraction and
interpretation of multimodal information from visually complex documents. |
en_US |