Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

The opacity of deep learning models hinders the ethical deployment of hate speech detection systems. To address this, we propose Supervised Rational Attention (SRA), the first framework that explicitly aligns Transformer attention weights with human-annotated rationales via joint optimization of classification loss and an attention-rationality alignment loss. SRA enhances model interpretability and fairness without compromising classification performance. Experiments on English and Portuguese hate speech datasets demonstrate that SRA achieves a 2.4× improvement in interpretability over strong baselines, generating token-level explanations that are more faithful to model behavior and better aligned with human cognitive reasoning. Moreover, SRA maintains competitive accuracy, F1-score, and fairness metrics—including reduced demographic parity and equalized odds disparities—across diverse subgroups. By bridging the gap between attention mechanisms and human-interpretable rationales, SRA advances principled, transparent, and equitable NLP systems for sensitive content moderation.

Technology Category

Application Category

📝 Abstract

The opaque nature of deep learning models presents significant challenges for the ethical deployment of hate speech detection systems. To address this limitation, we introduce Supervised Rational Attention (SRA), a framework that explicitly aligns model attention with human rationales, improving both interpretability and fairness in hate speech classification. SRA integrates a supervised attention mechanism into transformer-based classifiers, optimizing a joint objective that combines standard classification loss with an alignment loss term that minimizes the discrepancy between attention weights and human-annotated rationales. We evaluated SRA on hate speech benchmarks in English (HateXplain) and Portuguese (HateBRXplain) with rationale annotations. Empirically, SRA achieves 2.4x better explainability compared to current baselines, and produces token-level explanations that are more faithful and human-aligned. In terms of fairness, SRA achieves competitive fairness across all measures, with second-best performance in detecting toxic posts targeting identity groups, while maintaining comparable results on other metrics. These findings demonstrate that incorporating human rationales into attention mechanisms can enhance interpretability and faithfulness without compromising fairness.

Problem

Research questions and friction points this paper is trying to address.

Improving interpretability of hate speech detection models using human rationales

Aligning model attention with human annotations to enhance classification fairness

Developing supervised attention framework for faithful hate speech explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised Rational Attention aligns model attention with human rationales

Integrates supervised attention mechanism into transformer-based classifiers

Optimizes joint objective combining classification and alignment losses

🔎 Similar Papers

No similar papers found.