NUST Institutional Repository

Supervised Topic Modelling in Urdu Language for Domain Lexicon Generation

Show simple item record

dc.contributor.author Mehboob, Aziya
dc.contributor.author Supervised by Dr. Hammad Afzal.
dc.date.accessioned 2020-11-17T07:09:18Z
dc.date.available 2020-11-17T07:09:18Z
dc.date.issued 2019-05
dc.identifier.other TCS-441
dc.identifier.other MSCS / MSSE-22
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/12418
dc.description.abstract Urdu is the 8th largest spoken language in the world with more than 140 million speakers in subcontinent. It is a national language of Pakistan and an official language of six Indian states. Despite its abundant usage, the research work in Urdu language is still limited. Textual data of Urdu is increasing day by data and it is very important to understand and extract information about underlying themes. LDA is a topic modeling technique that has been massively applied on a large set of textual data to uncover latent themes. It is based on “bag of word” assumption. For Urdu language, small research work on topic modeling is carried out which is limited to unigrams. In this thesis, an alternative method for Urdu LDA based on ”bag of words” plus “bag of multi-word terms” is developed. Multi- word terms are extracted using Cvalue method; these terms are then integrated with words. As LDA is an unsupervised method,it doesn’t provide any label to the extracted theme. In proposed framework,an automatic labeling of Urdu topic models is also developed. Candidate labels are extracted using linguistic filters and then these labels are ranked based on similarity of topics and candidate label vectors; using word2vector and letter trigram.Experiments are performed on Urdu corpus which is collected from BBC urdu. Results are evaluated using word intrusion user study method and coherence score, also performance of model is tested on unseen documents. For automatic labeling,results are validated by domain experts, which demonstrate that our framework can aid Urdu researchers gain fast and better understanding of their Urdu document collections. en_US
dc.language.iso en en_US
dc.publisher MCS en_US
dc.title Supervised Topic Modelling in Urdu Language for Domain Lexicon Generation en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account