Document Retrieval Using Unsupervised Text Classification With Word Embeddings

Siraj, Zahra

DSpace Home
→
E-Theses
→
SEECS
→
Computer Science
→
MS
→
View Item

Document Retrieval Using Unsupervised Text Classification With Word Embeddings

Siraj, Zahra

URI: http://10.250.8.41:8080/xmlui/handle/123456789/30048

Date: 2022

Abstract:

Text classification is the process of categorizing a text phrase or text docu ment with an appropriate label. Supervised learning, is the most common method used for classifying texts. The traditional methods of text classifi cation often need a substantial quantity of labelled training data. However, it is not always possible to get a labelled text dataset for the purpose of training classification algorithms. Data labelling often requires a significant amount of time and cost. Insufficient or unlabelled data can be a problem in classification tasks. As a result, unsupervised methods provide the potential to do low-cost text categorization for unlabelled data. The concept of this dissertation revolves around unsupervised text classification using word em beddings. A previous study generated the results using some generic word embeddings such as word2vec, GloVe, and Doc2vec. We have used Lbl2vec approach to perform unsupervised text classification. where a document can be classified and assigned a category by looking at the distance between each label vector and the centroid of the document vector. This model is used to classify unlabelled texts into different categories. After this classification process, the performance of the classifiers is measured with different evalu ation metrics like precision, recall, and F1 measures. Our experiments on some bench-mark text dataset show that the proposed method raises the F1 score to 0.81. A comparison analysis is made to show the classification results with respect to different supervised and unsupervised classification algorithms. Some bench-mark text-based datasets 20newsgroup, AG news group, have been used for comparison and evaluation purposes.