Self-Explaining Hate Speech Detection with Moral Rationales

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the limitations of current hate speech detection models, which often rely on superficial lexical cues, suffer from spurious correlations, and lack robustness, cultural adaptability, and interpretability. To overcome these issues, the authors propose a Supervised Moral Reasoning Attention (SMRA) mechanism that leverages expert-annotated moral rationales—grounded in Moral Foundations Theory—as direct supervision signals for attention alignment, thereby guiding the model to focus on morally salient text segments within an inherently interpretable framework. Key contributions include the release of HateBRMoralXplain, the first Brazilian Portuguese dataset annotated with token-level moral rationales; a multi-task learning setup combining hate speech detection and moral sentiment classification; and the adoption of IoU F1, Token F1, and sufficiency/fairness metrics to evaluate explanation quality. Experiments demonstrate improvements of 0.9 and 1.5 F1 points in binary hate detection and multi-label moral sentiment classification, respectively, along with significantly more faithful (IoU F1 +7.4, Token F1 +5.0), concise, and sufficient explanations while maintaining fairness.

Technology Category

Application Category

📝 Abstract

Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs

Problem

Research questions and friction points this paper is trying to address.

hate speech detection

spurious correlations

interpretability

cultural contextualization

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

moral rationales

self-explaining models

attention supervision