ExtractID: Deep Information Extraction from Identity Documents

ZAIN UL ABIDIN, SYED

DSpace Home
→
E-Theses
→
SEECS
→
Electrical Engineering
→
MS
→
View Item

ExtractID: Deep Information Extraction from Identity Documents

ZAIN UL ABIDIN, SYED

URI: http://10.250.8.41:8080/xmlui/handle/123456789/44539

Date: 2024

Abstract:

Efficient extraction of information from identification documents, such as Computerized National Identity Cards (CNICs), is a pivotal aspect in modern document analysis and information retrieval systems. Traditional Optical Character Recognition (OCR) techniques often fall short in handling diverse challenges posed by real-world scenarios, including blurred images, varying illumination conditions, and complex backgrounds. This thesis presents an innovative approach leveraging an OCR-free algorithm known as "Donut" with pre-processing and optimizing techniques to enhance the accuracy and robustness of information extraction tasks. The study initiates with localization task utilizing YOLOv5 for detection of text, coupled with OCR-based recognition and extraction using Tesseract. Recognizing the limitations of OCR techniques, the research transitions to the OCR-free approach, preparing a self-annotated dataset of CNICs en coded in JSON lines text format. The proposed methodology involves dataset pre-processing and augmentation techniques for training, encompassing random crop, random rotate, random brightness-contrast adjustments, and Gaussian noise injection. The Donut model configuration is detailed, and the model is optimized in terms of memory, emphasizing its adaptability to handle various challenges in visual data, including blurred, dark, bright, and noisy images. Notably, the model exhibits a remarkable accuracy of 99.96% with an F1 score of 99.46% on test data with our proposed pipeline, showcasing its robust performance in real-world conditions. Also, HTML bio-data forms are prepared and trained with the same pipeline for Donut model, exhibiting consistent performance for test data. To facilitate practical implementation, a Django API is developed for seamless testing of images, demonstrating the model’s effectiveness in real-time applications. The findings of this research underscore the significance of OCR-free approaches, specifically the Donut algorithm, in overcoming the limitations of traditional OCR techniques. The outcomes confirm the model’s exceptional performance in information extraction tasks related to ID cards, laying the foundation for advancements in document analysis, identity verification, and broader applications in the field of information retrieval.