🤖 AI Summary
NLP models often replicate societal biases and lack interpretability, hindering fairness assurance. This paper presents the first large-scale quantitative study systematically examining how input-oriented explanation methods—such as attention weights and saliency maps—affect fairness in hate speech detection. Through multi-model experiments based on encoder-decoder architectures, we find that such explanations effectively identify bias-driven predictions and provide reliable supervision signals for debiasing training. However, they exhibit limited stability and reliability when used for cross-model fairness comparison or selection. Our key contribution is the empirical demonstration of an asymmetric relationship between interpretability and fairness: explanation methods are suitable for bias diagnosis and intervention but not for ranking models by fairness. This work fills a critical gap in quantitative, interpretable fairness research within NLP.
📝 Abstract
Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.