Abstract:
Existing commercial software such as ABBY and Google vision API provides support for Arabic and Urdu text. Still, accuracies are low because of the writing style of a non-Latin text. OCR for Urdu started way back in 2003 when a system could recognise isolated Urdu Characters only, but with the increase of data and digitisation research interest towards Urdu OCR increased. Other motivations include the Urdu data explosion in financial and economic sectors, including printed and handwritten scanned documents. Optical Character Recognition for Urdu is challenging due to the fact that being a non-Latin script it has a cursive writing style. These challenges need to be solved in different phases of the OCR system. This thesis presents an Urdu Optical Character Recognition (OCR) system and a data generation and encoding technique that is useful to standardise data for optical character recognition. The proposed model consists of a four staged network, and the first stage normalises the image while the second stage and the third stage is used for feature extraction and sequence generation. The final stage is the prediction stage which is responsible for predicting digitised text present in an image. The proposed algorithm is compared against baseline implementations of widely adapted supervised deep learning methods.