Urdu topic modeling: Discovering the abstract topics of Urdu text

Ahmad, Waheed

DSpace Home
→
E-Theses
→
SEECS
→
Computer Science
→
MS
→
View Item

dc.contributor.author	Ahmad, Waheed
dc.date.accessioned	2022-08-07T13:18:18Z
dc.date.available	2022-08-07T13:18:18Z
dc.date.issued	2022
dc.identifier.uri	http://10.250.8.41:8080/xmlui/handle/123456789/30046
dc.description	CL-T-6618	en_US
dc.description.abstract	Topic modeling is a popular tool for analysing large corpus of textual data and to identify and discover the hidden structure and relations within the corpus.To accomplish this task, the focus is on utilising the models in Natural Language Processing related to topic modeling such as BERTopic, LDA and NMF. Two major prevalent topic modelling strategies are LDA and NMF. However, new methods that are based on Bidirectional transformers exist. LDA is a probablity based model and the NNMF uses a matrix factorization approach. In this research, we will test and compare two most common topic models LDA and NMF with a transformer based newer model BERTopic for Urdu language datasets. The results of topic modelling strategies are evaluated using different coherence measures such as c_npmi,c_v,c_uci,cu_mass measures. In comparison to NMF and LDA, the overall results provided by BERTopic were superior. Pre-trained language models based on transformers give word and sentence embeddings dependent on the context of the term.The results can then compared to topic modelling strategies using classic machine learning techniques.This is done to see if topic model quality has increased, as well as how reliant the strategies are on manually generated model hyper parameters and data preprocessing. These topic models are useful for summarising and organising a huge text corpus, as well as providing an overview of how topics change over time. The dataset used for thesis is Urdu columns and Urdu news from Urdu websites such as Hamariweb.com and Urdupoint.com, it has more than 100k articles to be used .	en_US
dc.description.sponsorship	Dr Abdul Wahid	en_US
dc.language.iso	en	en_US
dc.publisher	SEECS-School of Electrical Engineering and Computer Science NUST Islamabad	en_US
dc.title	Urdu topic modeling: Discovering the abstract topics of Urdu text	en_US
dc.type	Thesis	en_US