Abstract:
Urdu is the 8th largest spoken language in the world with more than 140 million
speakers in subcontinent. It is a national language of Pakistan and an official language
of six Indian states. Despite its abundant usage, the research work in Urdu language is
still limited. Textual data of Urdu is increasing day by data and it is very important to
understand and extract information about underlying themes. LDA is a topic modeling
technique that has been massively applied on a large set of textual data to uncover
latent themes. It is based on “bag of word” assumption. For Urdu language, small
research work on topic modeling is carried out which is limited to unigrams.
In this thesis, an alternative method for Urdu LDA based on ”bag of words” plus
“bag of multi-word terms” is developed. Multi- word terms are extracted using Cvalue
method; these terms are then integrated with words. As LDA is an unsupervised
method,it doesn’t provide any label to the extracted theme. In proposed framework,an
automatic labeling of Urdu topic models is also developed. Candidate labels are extracted
using linguistic filters and then these labels are ranked based on similarity of
topics and candidate label vectors; using word2vector and letter trigram.Experiments
are performed on Urdu corpus which is collected from BBC urdu. Results are evaluated
using word intrusion user study method and coherence score, also performance of
model is tested on unseen documents. For automatic labeling,results are validated by
domain experts, which demonstrate that our framework can aid Urdu researchers gain
fast and better understanding of their Urdu document collections.