Abstract:
The rapid expansion of social media has intensified the spread of hate speech and online
harassment, posing serious threats to vulnerable groups based on age, gender, religion,
and ethnicity. While Artificial Intelligence (AI) offers powerful tools for detecting and
mitigating toxic content, existing AI models often suffer from two critical limitations:
biased predictions that disproportionately impact specific communities and a lack of
interpretability that hinders trust in the results. Most hate speech detection models
overlook the need for transparent explanations behind their classifications, leaving users
and affected communities uncertain about how decisions are made. Addressing these
gaps is essential for developing fair and trustworthy AI solutions that protect targeted
groups from online abuse, which can escalate into real-world violence. This research
aims to tackle the problem of hate speech by developing a method that integrates
Explainable Artificial Intelligence (XAI) to provide clear and understandable explanations, to reduce the bias that target specific groups based on age, gender, ethnicity,
and religion. For developing a hate speech detection system that incorporates XAI,
we initially started by applying five machine learning models that include Multinomial
Naïve Bayes (MNB), Logistic Regression (LR), Long Short Term Memory (LSTM),
and the Bidirectional Transformer model BERT on the HateXplain benchmark dataset
for text classification. The results revealed that BERT outperformed the other models, achieving an accuracy of 98.5%. To interpret the model’s predictions, we used two
explainability methods, LIME and SHAP, that provided insights into the features influencing the classification decisions. In order to detect hateful content targeted at specific
groups, we developed a multiclass word list based on attributes like age, religion, gender, and ethnicity. After comparing the model’s output with the multiclass word list,
we utilized these keywords to redefine and update the data, followed by retraining the
BERT model. Finally, we provided explanations for hate speech targeted towards a
specific group. In the end, the explainable methods are evaluated based on comprehensiveness, sufficiency, and Intersection over Union (IoU) to determine their effectiveness,
that measure how well the model-generated explanations align with human-annotated
rationales and results showed that while both LIME and SHAP performed comparably
in providing explanations, SHAP proved to be more computationally expensive and
time-consuming. However, this work opens up promising opportunities for further research in enhancing explainability methods. Future work could explore additional XAI
approaches, as well as apply these methods to diverse datasets to further enhance the
effectiveness of explainability methods.