Abstract:
Text recognition from images has received a lot of attention of the researchers for
over three decades. It is one of the most worked upon problem in the domain of pattern
classification. The optical character recognitions systems can be used to recognize numeric
digits, alphabets, words and sentences in any language and can be applied to recognize scenic
text for the assistance of visually impaired persons. These systems can also be used for the
assisted navigation of autonomous vehicles. Despite all the related work that has been done
for the Latin scripts, the recognition of text written in non-Latin languages like Urdu, Arabic,
Pashto etc. has always been a challenging task due to complex cursive nature of the script.
Urdu OCR systems can be used for digitization of the data, which will further allow us to use
it for search and retrieval of the specific information. These kinds of systems grant us the
easy access to content based information retrieval. Extracting embedded text from the
images is an active area of research in the community of Document analysis as well. The
currently available OCR frameworks mainly focus on the recognition of Latin texts like
English script etc. and they cannot be applied for non-Latin languages. So we are looking to
implement a solution based on deep framework for the line level recognition of Urdu text, to
extract useful information from the news tickers.
In this work, we are focusing to design an end-to-end system that will detect and
recognize the Urdu text embedded in TV channel streams that is commonly written in
Nasta‘liq scripting style. The development of Urdu OCR systems consist mainly of two
subtasks, text detection and text recognition. For the development of robust recognition
systems, the availability and access of a huge quantity of annotated data is the first and the
foremost requirement. So, the dataset used here has been collected from different news
channels and is comprehensive enough to cover the low and high resolution images. It
includes the distorted, low quality as well as faded news tickers making it ideal for testing the
performance of any Urdu News OCR system. Once the text has been detected or localized in an image, it can be cropped and used for the recognition part. For the recognition task, a
language independent Convolutional Recurrent Neural Network (CRNN) based end-to-end
architecture has been proposed with CTC loss function for the line level recognition of Urdu
text embedded in the news tickers of TV channel streams. In this proposed system, a large
number of different techniques have been used for data augmentation. These kinds of data
variations will prevent the model from over fitting and help it to generalize better. Finally,
the results of this approach have been presented on the test set. The achieved results are 0.63% CER, 6.43% WER, 5.14% LER and levenshtein distance of 0.02 on the Urdu Ticker
Text dataset. These results indicate that our proposed methodology has shown outstanding
performance as compared to the commercially available recognition systems and this
proposed methodology can be applied to a variety of other non-Latin scripts as well. In this
thesis, we also discussed the common problems faced when dealing with the low-resource
language recognizers like Urdu and Arabic etc. The outcomes of this study are expected to be
applicable and useful for the researchers, working on the recognition of non-Latin languages
written in cursive scripting style.