Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

NLP models often replicate societal biases and lack interpretability, hindering fairness assurance. This paper presents the first large-scale quantitative study systematically examining how input-oriented explanation methods—such as attention weights and saliency maps—affect fairness in hate speech detection. Through multi-model experiments based on encoder-decoder architectures, we find that such explanations effectively identify bias-driven predictions and provide reliable supervision signals for debiasing training. However, they exhibit limited stability and reliability when used for cross-model fairness comparison or selection. Our key contribution is the empirical demonstration of an asymmetric relationship between interpretability and fairness: explanation methods are suitable for bias diagnosis and intervention but not for ranking models by fairness. This work fills a critical gap in quantitative, interpretable fairness research within NLP.

Technology Category

Application Category

📝 Abstract

Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.

Problem

Research questions and friction points this paper is trying to address.

Investigating how input-based explanations affect fairness in hate speech detection

Examining if explanations help identify biased predictions and mitigate model bias

Assessing reliability of explanations for selecting fair NLP models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Input-based explanations detect biased predictions effectively

Input-based explanations reduce bias during model training

Input-based explanations unreliable for selecting fair models

🔎 Similar Papers

No similar papers found.