Information Retrieval using Machine Learning

Kenan Azam

DSpace Home
→
E-Theses
→
SEECS
→
Information Technology
→
BS
→
View Item

dc.contributor.author	Kenan Azam
dc.date.accessioned	2020-11-20T09:32:08Z
dc.date.available	2020-11-20T09:32:08Z
dc.date.issued	2004
dc.identifier.uri	http://10.250.8.41:8080/xmlui/handle/123456789/13165
dc.description	Supervisor: Mr. Shahzad Khan	en_US
dc.description.abstract	The need for Information Retrieval emerged with the very humble beginnings of mankind. With the evolution of earlier species into Homo sapiens, the quest for information retrieval expedited. As time passed on, information started to grow and it had to be stored in some form. It was this storage of information that gave rise to the issue of its retrieval. The earliest records of IR techniques date back to third century B.C. [Hess, 1955], thus making IR one of the most mature areas in Computer Science. This quest has not ended yet and as information continues to be gathered at unimaginable rates and quantities in our era, the issue of its retrieval has become even more profound. Information retrieval is therefore, concerned with the selection of relevant documents from a collection desired by an inquirer. With the emergence of Artificial Intelligence techniques and the efforts of IR community this quest might hopefully come to a satisfying end in the near future (couple of decades). This research project therefore, started out with a study of the existing IR models with machine learning in focus and involved the implementation of the best available model along with some humble contributions to this model and other IR areas including Phrase extraction, Document Clustering, Word Sense Disambiguation and finally Relevance Feedback. The fundamental information retrieval model used is a probabilistic model based on Bayesian Belief networks, which are an integral part of many learning systems. Retrieval is considered as an evidential reasoning process where various evidences about documents and queries are considered to match the query to relevant documents. This model was developed by Bruce Croft and Howard Turtle in the Information Retrieval Laboratory at the University of Massachusetts Amherst, and is currently considered to be one of the best models as shown by the results of TREC (Conference on Text Retrieval)[TREC, 1997]. To test the effectiveness of this model, it was implemented over a document collection (NEWS), built by using news articles from the national newspaper “DAWN”. The performance of the system was benchmarked against an existing IR system “PIRS” based on the popular Vector Space Model built in NIIT under supervision of Asst.Prof Shahzad Khan. The performance of our model was significantly better than the vector space model, which verified the results of TREC (Conference on Text Retrieval) [TREC, 1997]. Once the basic Bayesian Belief Model was up and running, it was modified to make it compliant with the use of Phrases as indexing terms. These experiments showed some surprising results, which hinted on the possibility of building a noun extractor based on phrases. A separate research was carried out on Document Clustering, which refers to clusters of similar documents. The benefits that arrive from the clustering or grouping of like documents include running queries faster by identifying the relevant document partition upon which the query should be run. This also has the effect of increasing both precision and recall if the correct cluster is identified. A novel approach of document clustering was discovered based on association between the indexing terms of documents. Although the results of clusters formed by the proposed method were studied, it was assumed that this would lead to enhanced query performance based on the ground breaking results of Fagan [Fagan, 1987]. During the course of the research project, the problem of Word Sense Disambiguation was also visited. The result was the construction of a word sense disambiguator which used a novel approach and it was positioned in the basic model to short list the set of relevant documents returned by the query evaluator. This is achieved by guessing the required sense of the inquirer from the query. Lastly, a separate research was carried out to incorporate relevance feedback into the basic model by using query expansion based on two different approaches one of which was constructed by us. References TREC (1997). Evaluation Techniques and Measures. Part of Appendix A in TREC-6 Proceedings. [Online] http://trec.nist.gov [Hess, 55] Hessel, Alferd. A History of Libraries. The Scarecrow Press, New Brunswick, NJ, 1955. Fagan, J. Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. Ph.D. Thesis, Technical Report 87-868, Cornell University, Computer Science Department, 1987.	en_US
dc.publisher	SEECS, National University of Sciences and Technology, Islamabad	en_US
dc.subject	Information Technology	en_US
dc.title	Information Retrieval using Machine Learning	en_US
dc.type	Thesis	en_US