Abstract:
Ranked Information Retrieval using Weighted TF IDF
Document Retrieval is the task of retrieving a relevant Document in response to a query,
a question, or a reference Document. Tasks such as question answering, summarization,
novelty detection, and information provenance make use of a Document retrieval module as a
preprocessing step. The performance of these systems is dependent on the quality of the
Document‐retrieval module. Other tasks such as information extraction and machine
translation operate on Documents, either using them as training data, or as the unit of input or
output (or both), and may benefit from Document retrieval to build a training corpus, or as a
post‐processing step.
In this thesis we begin by studying IR Model, then we build a through understanding of exiting
IR algorithms like TFIDF, Okapi BM25 and Pivoted length normalization to name a few. During
the study of the mentioned algorithms we come up with some deficiencies in retrieval
algorithms and started working to eradicate those deficiencies. We proposed a better approach
for scoring documents named Weighted TF IDF (WTF IDF) instead of TF IDF where terms are
counted rather than weighted with respect to locality of documents and term order. More over
we planned to cope with different writing styles by looking for synonym query along with
original query, this increase the chances of retrieving some novel information from the corpus.
We have provided the implementation of exiting algorithms and compare the performance
with proposed approach WTF IDF and presented the result. The proposed approach has better
results than the exiting ones