Abstract:
Optical character recognition (OCR) is an important research area in pattern recognition. OCR is to convert the digital image obtained through scanner to computer editable format. OCR is not only important for digitizing literature but also because of its increased usage in banking, business and data applications. Because of its importance many languages has well developed OCR’s e.g. English, Chinese and Latin etc. Nastaliq script is the most famous calligraphic script for Urdu Language but very less work is available for Nastaliq Optical character Recognition mainly because of complexity associated with the script. To get complete OCR for any language, it is necessary to develop an OCR system i.e. character/segmentation based. Work available on Urdu OCR is either on isolated characters or on segmentation free approach for ligatures identification.
The presented technique proposes Nastaliq OCR system that is based on segmentation of ligatures to individual characters. Structural features of initial, medial, final and isolated characters of Urdu are identified and based on these features characters are recognized and segmented. Unlike existing approaches, the proposed technique performs recognition and segmentation side by side. For isolated, first and final characters, base shape are identified first using structural features, then passed through special segmentation procedure and finally with the help of dots exact character is identified. For middle characters, based on position, existence and non existence of dots characters are segmented first and then identified.
The proposed technique offers a novel method for segmentation of Nastaliq script to individual characters and recognition of characters instead of ligatures. As segmentation based approach is used so it removes training time and reduces overhead in getting features of whole ligatures.
Results show 84% of accuracy which can further be increased by improving selection of features and by removing dot thinning, thinning and dot overlapping problems.