Abstract:
Clustering is a useful technique in the field of textual data mining. Cluster
analysis divides objects into meaningful groups called clusters based on information and
relationship between objects. Bunch of material is available related to any topic from
internet by just one click. It becomes tedious on user’s end to differentiate between data
and really required information. This task is very hard as it has to be done manually. This
project will explain how to cope with this problem to effectively facilitate the user. We
used ROCK algorithm with some modifications. ROCK generates better clusters than
other clustering algorithms for data with categorical attributes. We used cosine measure
to know the similarity between two documents. Furthermore, we used adjacency list
instead of sparse matrix to store the document. The evaluation of algorithm has been
done on text documents. Due to these enhancements it is named as Enhanced ROCK or
EROCK. These changes affect the time space complexity of the algorithm.
Experimental results on standard test documents show the outcomes of the
EROCK algorithm. Similarity threshold, number of clusters to be obtained and text
documents (corpus) are the main parameters used for EROCK evaluation.
JAVA with jdk1.6.0 has been used for implementation of the EROCK. NetBeans
IDE 6.5.1 has been used as a development editor. Experiments have been carried out on a
variety of standard text documents with specific approach.