Abstract:
Writing is a codified system of standard symbols: the repetition of agreed-upon simple shapes to represent ideas. Language using symbols is assumed to be universal which is easier to interpret and efficient to use. Handwriting has remained one of the most frequently occurring patterns that we come across in everyday life. Handwriting offers a number of interesting pattern classification problems including handwriting recognition, writer identification, signature verification, writer demographics classification and script recognition etc. There is a dire need to address these problems and all out efforts be made to devise a script independent framework that can be applied globally to maximize the advantages of wealth of knowledge contained in the form of handwritten scripts. Lot of research in this area is ongoing. The work presented here is a document indexing and retrieval system using word spotting as the matching technique. Word spotting presents an attractive alternative to the traditional Optical Character Recognition (OCR) systems where instead of converting the image into text, retrieval is based on matching the images of words using pattern classification techniques. Proposed system relies on extracting words from images of handwritten documents and converting each word into a shape represented by its contour. Conversion of words into shapes is an innovation proposed in our framework that will set new avenues of research; as this work has not been experimented before in the history of word spotting. A set of multiple features is then extracted from each shaped word and instances of the same word are grouped into clusters. These clusters are used to train a multi-class Support Vector Machine (SVM) which learns different word classes. The documents to be indexed are segmented into words and the closest cluster for each word is determined using the SVM. An index file is maintained for each word cluster which keeps information on the documents containing the respective word along-with the word locations within each document. A query word presented to the system is matched with the clusters in the database and the documents containing occurrences of the query word are presented to the user. The system evaluated on the handwritten images of IAM database reported promising precision and recall rates. Enhancement of feature vector space by introducing new set of features is also a major contribution. Study has also been carried out to analyze the contribution and significance of different features employed in our study. Use of most relevant feature vector through employment of Principal Component Analysis (PCA) has also been applied to condense the dimensionality. The proposed framework has also been successfully tested in extremely challenging / cursive Urdu language scripts. Promising results in both English and Urdu scripts amply proves script independence that can be applied globally.