Abstract:
Continuous use of hate speech in different languages on social media has drawn significant attention in the past few years. Online hate speech inflicts on society, detecting hate speech is necessary irrespective of the scale of use of languages. Our aim is to develop the first resource for offensive language and hate speech detection in Urdu language by collecting data from Twitter. While most of the work in this domain is done in English language and work is centeredaround the major target categories of hate speech such as religion, racism, sexism etc., we further divide these major categories into subcategories based on the type or degree of hate conveyed. The characteristics taken into account for study of hate speech include one’s religion, ethnicity and national origin. Each of these categories is further divided into subcategories i.e. symbolization, insult and attribution. Date is manually annotated from the samples in the corpus against the corresponding hate speech categories and subcategories and also the samples are labeledas offensive or not. Then Experiments are performed with existing classifiers and examined the impact of different features for hate speech detection.