Abstract:
The advent of Deep Learning in Computer Vision has resulted in advancements in many
domains of life encompassing a diverse set of fields. Object Character Recognition plays
a vital role in the modern age of Artificial Intelligence. It is a challenging task, difficult to
implement, and computationally expensive. Sindhi is a literature-rich language spoken
by millions of people around the globe. It has an exuberance of preserved grammatical
forms. There has been a significant development in OCR systems for English. Little
work has been done on Arabic script. Most of the Sindhi literature uses the extended
Perso-Arabic script. No benchmark datasets have been published to the best of our
knowledge. Consequently no state-of-the-art Sindhi OCR models have been devised.
This thesis attempts to fill this research gap by making the following contributions. We
have extracted a set of 22,597 ligatures that are found in Sindhi literature. We present
a synthesized benchmark dataset for Sindhi printed text recognition at ligature level.
The dataset is font diverse, comprising of 256 unique fonts. Finally, we have setup a
baseline neural network for Sindhi Ligature Recognition in printed text. It has achieved
91.85% test accuracy on the benchmark dataset. Our baseline can be used to build the
complete pipeline of a Sindhi OCR that is font invariant.