Abstract:
Daily amount of news reporting in real-world events is growing exponentially,
at the same time, people need most important information about any
event or any topic in an organized or compact form to make decisions. Document
summarization addresses the problem of presenting the information
in a compact form to the readers. Di erent approaches to summarize documents
have been proposed and evaluated in literature. Common research
problems in summarization are redundancy and extraction of sentences; that
are important and semantically linked with other sentences.
The proposed summarization approach is a combination of agglomerative hierarchical
clustering and Latent Semantic Analysis (LSA); which measures
the semantic similarity among di erent terms and reduces dimensions by
preserving only highly weighted vectors, we propose a novel multi document
summarization approach. To identify important terms in our summary, we
have used Latent Dirichlet Allocation Model (LDA). LDA is a generative
statistical model which allows a set of observations to be explained by a set
of small number of topics, where the presence of each word is attributable to
the topics of the documents.
We have used Recall Oriented Understudy for Gisting Evaluation (ROUGE)
metric for the evaluation of our system against other state-of-the art techniques
using Document Understanding Conference (DUC) dataset 2004. Experimental
results show that there is substantial performance improvement
using our system and it makes a coherent summary as compared to the other
state-of-art techniques. Our summarization approach improves upon current
state-of-the-art summarization systems on mainstream evaluation datasets.