NUST Institutional Repository

Urdu topic modeling: Discovering the abstract topics of Urdu text

Show simple item record

dc.contributor.author Ahmad, Waheed
dc.date.accessioned 2022-08-07T13:18:18Z
dc.date.available 2022-08-07T13:18:18Z
dc.date.issued 2022
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/30046
dc.description CL-T-6618 en_US
dc.description.abstract Topic modeling is a popular tool for analysing large corpus of textual data and to identify and discover the hidden structure and relations within the corpus.To accomplish this task, the focus is on utilising the models in Natural Language Processing related to topic modeling such as BERTopic, LDA and NMF. Two major prevalent topic modelling strategies are LDA and NMF. However, new methods that are based on Bidirectional transformers exist. LDA is a probablity based model and the NNMF uses a matrix factorization approach. In this research, we will test and compare two most common topic models LDA and NMF with a transformer based newer model BERTopic for Urdu language datasets. The results of topic modelling strategies are evaluated using different coherence measures such as c_npmi,c_v,c_uci,cu_mass measures. In comparison to NMF and LDA, the overall results provided by BERTopic were superior. Pre-trained language models based on transformers give word and sentence embeddings dependent on the context of the term.The results can then compared to topic modelling strategies using classic machine learning techniques.This is done to see if topic model quality has increased, as well as how reliant the strategies are on manually generated model hyper parameters and data preprocessing. These topic models are useful for summarising and organising a huge text corpus, as well as providing an overview of how topics change over time. The dataset used for thesis is Urdu columns and Urdu news from Urdu websites such as Hamariweb.com and Urdupoint.com, it has more than 100k articles to be used . en_US
dc.description.sponsorship Dr Abdul Wahid en_US
dc.language.iso en en_US
dc.publisher SEECS-School of Electrical Engineering and Computer Science NUST Islamabad en_US
dc.title Urdu topic modeling: Discovering the abstract topics of Urdu text en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • MS [375]

Show simple item record

Search DSpace


Advanced Search

Browse

My Account