dc.description.abstract |
Topic modeling is a popular tool for analysing large corpus of textual data and to identify
and discover the hidden structure and relations within the corpus.To accomplish this
task, the focus is on utilising the models in Natural Language Processing related to
topic modeling such as BERTopic, LDA and NMF. Two major prevalent topic modelling
strategies are LDA and NMF. However, new methods that are based on Bidirectional
transformers exist. LDA is a probablity based model and the NNMF uses a matrix
factorization approach. In this research, we will test and compare two most common
topic models LDA and NMF with a transformer based newer model BERTopic for Urdu
language datasets. The results of topic modelling strategies are evaluated using different
coherence measures such as c_npmi,c_v,c_uci,cu_mass measures. In comparison to
NMF and LDA, the overall results provided by BERTopic were superior. Pre-trained
language models based on transformers give word and sentence embeddings dependent
on the context of the term.The results can then compared to topic modelling strategies
using classic machine learning techniques.This is done to see if topic model quality has
increased, as well as how reliant the strategies are on manually generated model hyper parameters and data preprocessing. These topic models are useful for summarising and
organising a huge text corpus, as well as providing an overview of how topics change
over time.
The dataset used for thesis is Urdu columns and Urdu news from Urdu websites such
as Hamariweb.com and Urdupoint.com, it has more than 100k articles to be used . |
en_US |