Supervised Topic Modelling in Urdu Language for Domain Lexicon Generation

Mehboob, Aziya; Supervised by Dr. Hammad Afzal.

DSpace Home
→
E-Theses
→
MCS
→
Computer Software Engineering
→
MSCS
→
View Item

dc.contributor.author	Mehboob, Aziya
dc.contributor.author	Supervised by Dr. Hammad Afzal.
dc.date.accessioned	2020-11-17T07:09:18Z
dc.date.available	2020-11-17T07:09:18Z
dc.date.issued	2019-05
dc.identifier.other	TCS-441
dc.identifier.other	MSCS / MSSE-22
dc.identifier.uri	http://10.250.8.41:8080/xmlui/handle/123456789/12418
dc.description.abstract	Urdu is the 8th largest spoken language in the world with more than 140 million speakers in subcontinent. It is a national language of Pakistan and an official language of six Indian states. Despite its abundant usage, the research work in Urdu language is still limited. Textual data of Urdu is increasing day by data and it is very important to understand and extract information about underlying themes. LDA is a topic modeling technique that has been massively applied on a large set of textual data to uncover latent themes. It is based on “bag of word” assumption. For Urdu language, small research work on topic modeling is carried out which is limited to unigrams. In this thesis, an alternative method for Urdu LDA based on ”bag of words” plus “bag of multi-word terms” is developed. Multi- word terms are extracted using Cvalue method; these terms are then integrated with words. As LDA is an unsupervised method,it doesn’t provide any label to the extracted theme. In proposed framework,an automatic labeling of Urdu topic models is also developed. Candidate labels are extracted using linguistic filters and then these labels are ranked based on similarity of topics and candidate label vectors; using word2vector and letter trigram.Experiments are performed on Urdu corpus which is collected from BBC urdu. Results are evaluated using word intrusion user study method and coherence score, also performance of model is tested on unseen documents. For automatic labeling,results are validated by domain experts, which demonstrate that our framework can aid Urdu researchers gain fast and better understanding of their Urdu document collections.	en_US
dc.language.iso	en	en_US
dc.publisher	MCS	en_US
dc.title	Supervised Topic Modelling in Urdu Language for Domain Lexicon Generation	en_US
dc.type	Thesis	en_US