Abstract:
In recent times, there has been a huge boom in the trend of online shopping. This has led
to an increase of data on such sites, especially in the form of reviews left by customers.
Depending on the country these websites are based in, these reviews can be in many
languages. Roman Urdu, a language spoken mainly in the subcontinent, is an example of
one such language. While lots of work has been done in the field of sentiment analysis
on popular languages like English, German and French, the same claim cannot be made
for Roman Urdu. This project aims to address this gap. The goal is to explore natural
language processing techniques to accurately classify Roman Urdu text. For this purpose,
reviews from the popular Pakistani e-commerce website Daraz.pk are used. The data
was first preprocessed to remove stop words and emojis. Then, to classify this data, a
DistilBERT model was finetuned on more than 10,000 reviews from Daraz.pk, achieving
an accuracy of 85 percent. Moreover, a web interface will be provided where the user
can paste a link of the product they want to buy, and generate visualizations based upon
which they can decide on buying the product or not. The potential users of this product
are customers of Daraz.pk who want a more robust system to decide whether a product
should be bought or not.