Abstract:
This research explores the strength and usefulness of text mining process in extracting and identifying the crime news from online Urdu news headlines. News classification is one of the important application areas of text mining. The news classification systems are language specific systems, designed for different languages such as English, Indonesian, Indian languages etc. A significant amount of research work regarding news classification has been done through the years in these languages, but a very little amount of work has been done till now, for the Urdu language. Primarily, the text classification in Urdu has been performed on a large text corpus collected from different online Urdu news websites. However, a huge amount of news data is being accessed by the users online which is providing the news and the useful information belonging to various categories and different domains of interest. Inorder to derive some useful applications which are beneficial for the users at large and to organize and manage this data efficiently, various news classification systems have been built.
This research focuses on the classification of online crime news headlines in Urdu language. The data gathered for the research work, is the online news related data i-e the news headlines in Urdu. It is collected from the Urdu news websites containing all the news in Urdu language in text format from three newspapers namely Nawa-i-Waqt, Jang and BBC Urdu. A classification system is built to classify the crime news which performs text preprocessing steps, further analysis of the structured text and then, applies the text classification technique. A crime words dictionary is made in Urdu which contains all the words that are related to the crime domain. This dictionary is also the part of the classification system which assists the
iii
process significantly to classify the crime news. The classifier is trained on two different datasets and the results are analyzed. For testing, the data passed to the classifier is the news data containing 500 news headlines which are in Urdu language. 96% crime news are classified correctly by the classifier. This classification system can be used to access different online Urdu news websites and display the crime news out of all the news present on that website. In order to achieve more enhanced results, the performance evaluation of this classification system is also done in the end as an essential part of this research.