dc.description.abstract |
Ever increasing influx of data over the internet has become a reality which is faced with
a challenge of sifting through and extracting meaningful information. During the last
two decades, users are being overwhelmed with both textual and multimedia data, due
to popularity of social media and news platforms. To cope up with the challenges of
information overload various research technologies have also gain popularity. Natural
Language Processing (NLP) has observed significant improvements for textual data processing
in terms of its efficiency and accuracy with the inception of Language Models
comprising of Deep Learning based Artificial Neural Networks. Automatic Summarization
(under the umbrella of NLP) is the process of extracting only the meaningful
information from text resulting into reducing the length of text as well as maintaining
the sense of it. Urdu Language despite 10th most spoken language in the world is still
a low resource language having little to no research in the field of Automatic Summarization
and NLP. Most of the research in restricted to high resource languages like
English. An effort is carried out to explore Deep Learning based Pre-trained Language
Models comprising of self-attentive transformers for both Extractive and Abstractive
Summarization capturing contextual information. Moreover, a summarization dataset
of 76k records is created by collecting article summary pairs from news domain. As per
best of our knowledge it will be the first and largest dataset available for Urdu Summarization.
Experimental Results demonstrated competitive evaluation score (ROUGE,
BERTScore) of summarization models finetuned on newly created dataset. Human evaluation
is also carried out identifying the shortcomings of automatic evaluation methods
in the field of summarization. |
en_US |