Abstract:
Arabic like scripts (e.g., Urdu Nastaleeq) are cursive in nature even in their printed
form. Unconstrained handwritten text in such scripts pose a stern challenge for a
deep learning based text recognition system as diverse writing styles make training
very challenging. Successful training of deep learning algorithms depends
heavily on the availability of huge amount of training data. Since neural networks
operate in a supervised learning manner, transcription of text for training purpose
becomes a very laborious and time-consuming task. This paper presents the first
comprehensive unconstrained Urdu handwriting dataset and a recognition system
based on Long Short-Term Memory (LSTM) architecture. This dataset has been
developed from more than 21 million corpus covering 7 domains of Urdu language
and having around 30 thousand unique ligatures. To evaluate the performance of
deep learning algorithms, several architectures of LSTM Networks have been explored
and MDLSTM was found best for offline Urdu handwriting recognition.
The model evaluated on test data yields 73:1% accuracy for character recognition.