dc.description.abstract |
The project deals with the creation of a Multilingual Information Retrieval System. The objectives were to build a search engine which would search Urdu/Arabic (Multilingual) Information from the internet and show results to the user. The project has been implemented in two parts by the division of modules. The modules of both the parts are:
The first part includes building Optical Character Recognition software (OCR) which takes textual images as input and produces raw Unicode Characters as output. Artificial Neural Networks are used to identify the text in the images and then assign corresponding Unicode Character to the identified text.
The second part includes the development of a User Interface to enter a search criteria (an Urdu/Arabic query) and a URL (as a point to start the search), making a web crawler to search Urdu/Arabic textual images through HTML web pages, maintaining an XML document (database) to structure the unstructured multimedia data (images) from the internet attained by the crawler, application of the concepts of digital image processing to filter the images for standardization, saving Urdu/Arabic text (Unicode Characters) from images (attained by the OCR) in XML documents, processing retrieved text against the user supplied query with query comparison algorithms, providing ranking of the results and last but not the least, display of results to the user.
This report explains the second part of the Urdu/Arabic Information Retrieval System termed as "Urdu/Arabic OCR – Image and Text Handling". |
en_US |