🤖 AI Summary
The opacity of deep learning models hinders the ethical deployment of hate speech detection systems. To address this, we propose Supervised Rational Attention (SRA), the first framework that explicitly aligns Transformer attention weights with human-annotated rationales via joint optimization of classification loss and an attention-rationality alignment loss. SRA enhances model interpretability and fairness without compromising classification performance. Experiments on English and Portuguese hate speech datasets demonstrate that SRA achieves a 2.4× improvement in interpretability over strong baselines, generating token-level explanations that are more faithful to model behavior and better aligned with human cognitive reasoning. Moreover, SRA maintains competitive accuracy, F1-score, and fairness metrics—including reduced demographic parity and equalized odds disparities—across diverse subgroups. By bridging the gap between attention mechanisms and human-interpretable rationales, SRA advances principled, transparent, and equitable NLP systems for sensitive content moderation.
📝 Abstract
The opaque nature of deep learning models presents significant challenges for the ethical deployment of hate speech detection systems. To address this limitation, we introduce Supervised Rational Attention (SRA), a framework that explicitly aligns model attention with human rationales, improving both interpretability and fairness in hate speech classification. SRA integrates a supervised attention mechanism into transformer-based classifiers, optimizing a joint objective that combines standard classification loss with an alignment loss term that minimizes the discrepancy between attention weights and human-annotated rationales. We evaluated SRA on hate speech benchmarks in English (HateXplain) and Portuguese (HateBRXplain) with rationale annotations. Empirically, SRA achieves 2.4x better explainability compared to current baselines, and produces token-level explanations that are more faithful and human-aligned. In terms of fairness, SRA achieves competitive fairness across all measures, with second-best performance in detecting toxic posts targeting identity groups, while maintaining comparable results on other metrics. These findings demonstrate that incorporating human rationales into attention mechanisms can enhance interpretability and faithfulness without compromising fairness.