DOCUMENT TOPIC GENERATION IN TEXT MINING BY USING CLUSTER ANALYSIS ANALYSIS WITH ENHANCED ROCK (EROCK

AHMAD, RIZWAN

DSpace Home
→
E-Theses
→
CEME
→
Computer Software Engineering
→
MS
→
View Item

DOCUMENT TOPIC GENERATION IN TEXT MINING BY USING CLUSTER ANALYSIS ANALYSIS WITH ENHANCED ROCK (EROCK

AHMAD, RIZWAN

URI: http://10.250.8.41:8080/xmlui/handle/123456789/37541

Date: 2010

Abstract:

Clustering is a useful technique in the field of textual data mining. Cluster analysis divides objects into meaningful groups called clusters based on information and relationship between objects. Bunch of material is available related to any topic from internet by just one click. It becomes tedious on user’s end to differentiate between data and really required information. This task is very hard as it has to be done manually. This project will explain how to cope with this problem to effectively facilitate the user. We used ROCK algorithm with some modifications. ROCK generates better clusters than other clustering algorithms for data with categorical attributes. We used cosine measure to know the similarity between two documents. Furthermore, we used adjacency list instead of sparse matrix to store the document. The evaluation of algorithm has been done on text documents. Due to these enhancements it is named as Enhanced ROCK or EROCK. These changes affect the time space complexity of the algorithm. Experimental results on standard test documents show the outcomes of the EROCK algorithm. Similarity threshold, number of clusters to be obtained and text documents (corpus) are the main parameters used for EROCK evaluation. JAVA with jdk1.6.0 has been used for implementation of the EROCK. NetBeans IDE 6.5.1 has been used as a development editor. Experiments have been carried out on a variety of standard text documents with specific approach.