Abstract:
Cyberbullying using offensive language on the Internet has become a major problem among all age groups. Automatic detection of offensive language from social media applications, websites, and blogs is a difficult but important task. In recent years, the presence of offensive language on social media platforms and automatic detection of such language is becoming a major challenge in modern society. The complexity of natural language constructs makes this task even more challenging. Until now, most of the research has focused on resource-rich languages like English. This study is about the detection of offensive language from the user's audio presented in a resource-poor language i.e., Pushto. We propose the first offensive dataset of Pushto containing user-generated Audio from social media. We use individual and combined n-grams techniques to extract features at word level and gender basis. We will apply classifiers from different machine learning techniques to detect offensive language from Pushto Audio.
Offensive Language detection Using Machine Learning (OLDUM) aims at developing a prototype of a system that, using machine learning, will be capable of detecting offensive words in Pashto language, helping in automating the process of AUDIO/VOICE note by the social media Applications/Website and therefore stopping any unethical activity.