Abstract:
Label Induction Grouping Algorithm (LINGO) is a grouping algorithm which is capable
of grouping text documents on the basis of similar contents. Mainly, LINGO has its application in
clustering documents incurred from web search engines and this way it also works like a search
engine. This algorithm performs its job in two main phases. The first phase of LINGO is mainly
comprised of inducing labels for the forming cluster using text documents and in the second phase,
its job is to provide/assign contents to these labels of clusters. The first phase of LINGO which is
label induction uses a famous information retrieval algorithm which is latent semantic indexing
analysis and so it induces labels of the cluster. The second phase of algorithm is content discovery
and contents of clusters are discovered using another information retrieval method which is vector
space model. This study is basically modification of already existing algorithm and we have
modified the method of content assignment to the induced labels. Latent semantic indexing
analysis is used for the content assignment of the clusters as well as for the label induction which
can provide us more improved recall and performance of the algorithm is also significantly
improved in terms of cluster quality and also the overlapping is reduced by introducing merge
operation before formation of final cluster.
For evaluation of new proposed algorithm, 20 news group dataset is used and the whole
research is performed using MATLAB version 2016b.