dc.description.abstract |
In the digital world that we live in today, internet has merged and become an integral part
of our life. It is difficult to imagine a life where there is no internet or no email for that
matter. Memories of life without internet seem to be a lore, especially for the generation
which is growing and up taking these things for granted.
Internet and email has resulted in disposal of huge amounts of information at everyone’s
footsteps. Advent of powerful search engines such as Google® has revolutionized
searches. Despite all this development, typically when ever we need something, we are
presented with hundreds if not thousands of potential answers. Things seem to have gone
out of control.
In such a scenario, other sources of information such as online journals, emails, online
newspapers and an online equivalent of just about anything and everything on paper, does
not help the cause. An average employee begins to feel the weight of too much
information at disposal. Instead of aiding in decision making, information becomes a
hindrance.
Information has become so abundant, that we have difficulty in extracting the right and
correct amount required for decision making. This problem has been dubbed as the
information overload, or too much information.
Information overload manifests itself in form of loss of productivity, health problems
ranging from mild headaches to depression and sub-optimum decision making.
This thesis expects to resolve three aspects of this problem by creating a system called
PIASA (Personal Information And Summarization Assistant), assigned to assist in three
key areas, namely intelligently filter out spam, intelligently mark online articles which
are of interest and automatic summarization of news articles.
The intelligent spam filter derives its intelligence using a combination of mathematical
weights assigned to individual words appearing in each mail combined using Bayesian
rule. This algorithm has achieved an average accuracy of 93%.
The intelligence of the article classifier is also based on a similar algorithm. In here
mathematical weights of individual words appearing in the title and description of the
news item are combined using the Bayesian rule into one result. Eventually 105 articles
were fed in as interesting and 12 as not- interesting. After classification of 598 articles, 42
were misclassified, to yield an accuracy of 92.45%. However one must remember that the
errors and subsequent learning is all part of the experiment. Hence, with more training the
results should improve.
The automatic summary generation tool, analyses online articles, strips their text,
analyses the text, and rejects sentences which are dependant on other sentences in the
text.It picks sentences which are independent in nature and convey complete
information. A collection of those sentences which have been deemed fit to be part of a
summary are presented eventually. |
en_US |