Abstract:
Detection of cyberbullying on social media platforms is becoming a vital challenge for researchers recently as it is at its peak. Therefore, research work, to tackle this issue, is carried out in numerous languages around the Globe. People use different types of Social media platforms, in their native languages, to express their points of view. And if they are to express their anger or frustration, besides positive views, they often use abusive or offensive wording in their native language. Although some languages have an automatic monitor and block offensive content detection systems but unfortunately limited to Resource-rich languages very rare for low-resourced languages. The main reason is the non-availability of datasets for native/local languages.
In recent years, unethical behavior in the cyber-environment has been revealed. The presence of offensive language on social media platforms and automatic detection of such language is becoming a major challenge in modern society. The complexity of natural language constructs makes this task even more challenging. Until now, most of the research has focused on resource-rich languages like English. Roman Punjabi and Punjabi are two scripts of writing the Punjabi language on social media
To the best of our knowledge, no or very little work has been done on our topic. But an increasing amount of attention by computational linguistic community has given to detect offensive language and hate speeches from several online social media applications like YouTube [4-6], Twitter [7-9], Facebook [10] and blogs [11-12], in resource rich languages.
Our inspiration is the “Automatic Detection of offensive Language for Urdu and Roman Urdu” by Muhammad at. el [17] and “Hate-Speech and Offensive Language Detection in Roman Urdu”, by Hammad Rizwan, Muhammad Haroon Shakeel2020.emnlp-main.197
Similarly, we have a huge community of Punjabi speakers in Pakistan, India and Bangladesh. Cyberbullying is at peak via social media between them. This is a critical
issue which need to be addressed. We have already started to create the dataset for the proposed solution likewise in other low resource languages.
This research work proposes a model for “Punjabi”, a very low resource language, which automatically detects offensive language/words present. To create a dataset for roman Punjabi, we select 100 thousand and 1000 comments/feedback separately from different social media platforms and then the dataset of 1000 comments was labeled as offensive and non-offensive manually. The proposed model ZSL is a machine learning problem in which a beginner detects the samples from classes that didn't make the cut viewed in exercise and forecasts the category toward which they belong. The observed/seen and non-observed/unobserved categories are combined using zero-shot approaches, which use auxiliary information to represent observable differentiating features of objects. ZSL for true categorization is attained at 0.45, or 76 percent, of the threshold value. This unsupervised algorithm divided the datasets into two groups: offensive and non-offensive. The same threshold value and distance algorithm can be used to categories un-labeled datasets (UDS). Unsupervised algorithms can classify any amount of unlabeled data with a very astounding 76 percent accuracy for text classification. One of the fundamental steps in classic machine learning or deep learning algorithms is training the algorithm, and for deep neural networks, a massive amount of training data is needed. These algorithms require a lot of computation. Contrarily, unsupervised algorithms have higher classification accuracy while being computationally less expensive.