Sindhi Ligature Recognition in Printed Text: A Large Scale Font Diverse Sindhi Ligature Recognition System

Ali, Zeeshan

DSpace Home
→
E-Theses
→
SEECS
→
Computer Science
→
MS
→
View Item

Sindhi Ligature Recognition in Printed Text: A Large Scale Font Diverse Sindhi Ligature Recognition System

Ali, Zeeshan

URI: http://10.250.8.41:8080/xmlui/handle/123456789/35440

Date: 2023

Abstract:

The advent of Deep Learning in Computer Vision has resulted in advancements in many domains of life encompassing a diverse set of fields. Object Character Recognition plays a vital role in the modern age of Artificial Intelligence. It is a challenging task, difficult to implement, and computationally expensive. Sindhi is a literature-rich language spoken by millions of people around the globe. It has an exuberance of preserved grammatical forms. There has been a significant development in OCR systems for English. Little work has been done on Arabic script. Most of the Sindhi literature uses the extended Perso-Arabic script. No benchmark datasets have been published to the best of our knowledge. Consequently no state-of-the-art Sindhi OCR models have been devised. This thesis attempts to fill this research gap by making the following contributions. We have extracted a set of 22,597 ligatures that are found in Sindhi literature. We present a synthesized benchmark dataset for Sindhi printed text recognition at ligature level. The dataset is font diverse, comprising of 256 unique fonts. Finally, we have setup a baseline neural network for Sindhi Ligature Recognition in printed text. It has achieved 91.85% test accuracy on the benchmark dataset. Our baseline can be used to build the complete pipeline of a Sindhi OCR that is font invariant.