Self-Explaining Hate Speech Detection with Moral Rationales

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current hate speech detection models, which often rely on superficial lexical cues, suffer from spurious correlations, and lack robustness, cultural adaptability, and interpretability. To overcome these issues, the authors propose a Supervised Moral Reasoning Attention (SMRA) mechanism that leverages expert-annotated moral rationales—grounded in Moral Foundations Theory—as direct supervision signals for attention alignment, thereby guiding the model to focus on morally salient text segments within an inherently interpretable framework. Key contributions include the release of HateBRMoralXplain, the first Brazilian Portuguese dataset annotated with token-level moral rationales; a multi-task learning setup combining hate speech detection and moral sentiment classification; and the adoption of IoU F1, Token F1, and sufficiency/fairness metrics to evaluate explanation quality. Experiments demonstrate improvements of 0.9 and 1.5 F1 points in binary hate detection and multi-label moral sentiment classification, respectively, along with significantly more faithful (IoU F1 +7.4, Token F1 +5.0), concise, and sufficient explanations while maintaining fairness.

Technology Category

Application Category

📝 Abstract
Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs
Problem

Research questions and friction points this paper is trying to address.

hate speech detection
spurious correlations
interpretability
cultural contextualization
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

moral rationales
self-explaining models
attention supervision
hate speech detection
interpretable AI
🔎 Similar Papers
No similar papers found.