Text Classification Using NLP

Fahid Bin Tariq

DSpace Home
→
E-Theses
→
SEECS
→
Electrical Engineering
→
MS
→
View Item

Text Classification Using NLP

Fahid Bin Tariq

URI: http://10.250.8.41:8080/xmlui/handle/123456789/42994

Date: 2024

Abstract:

This research is based on the fine tuning of BERT model and applying long short term memory layers (LSTM). Bert which is already well known for text classification is being used along with other layers to further enhance the accuracy of text classification. Many existing Specific Bert models are available but they are only trained for a specific task. This paper shows the classification of four different classes: chats, emails, news and tweets. The method is pretty simple, at first dataset for each target class is collected and preprocessed using NLP libraries to remove extra and useless data from the datasets. The data-loader is prepared to feed the testing and validation data into a BERT base model. Before that, Bert tokenizer is used as Bert only takes data which is presented in a specific format having Special tokens ([CLS] and [SEP]). Using a recommended approach Bert is fine-tuned one by one for all target classes. The innovation is the introduction of LSTM layers merged with fully connected (FC) and some pooling layers in case. The trained output of Bert, which is Bert-embedding, is being used as an input to the LSTM model. Although, Bert alone could perform well, but for some complex datasets these additional layers have provided an edge. As LSTM layers are being used in bidirection to further capture the feature in the text to classify them more efficiently. Accuracy is enhanced overall. However, for binary classification with limited datasets there is a minor change in accuracy by introducing LSTM layers but for multi-classification with complex data, the accuracy is noticeable. For chats achieved accuracy is 99%; for emails, 98%; for news, 97%; and for tweets (complex data with multi-label sentiment analysis), 85%. These accuracies in comparison with alone Bert model are more efficient.