MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanation

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based moral reasoning evaluation suffers from two key limitations: (1) a lack of interpretable, fine-grained moral annotations, and (2) heavy reliance on English, hindering cross-cultural assessment. To address these, we introduce the first multilingual, multi-hop hate speech explanation dataset grounded in Moral Foundations Theory (MFT), covering English, Portuguese, Italian, and Persian. Our dataset innovatively incorporates text-span-level rationale annotations, enabling three-dimensional fine-grained evaluation: binary hate speech detection, moral foundation classification, and rationale extraction. Experimental results reveal that current LLMs achieve robust performance on hate speech detection (F1 = 0.836) but exhibit severe deficiencies in moral sentiment prediction (F1 < 0.35) and rationale-decision alignment—particularly for low-resource languages—exposing fundamental weaknesses in cross-lingual moral reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' moral reasoning via multilingual hate speech explanations
Addressing lack of moral justification annotations in current benchmarks
Assessing cultural diversity gaps in moral reasoning evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual hate speech dataset with moral annotations
Moral Foundation Theory for multi-hop explanations
Evaluates LLM moral reasoning across diverse languages
🔎 Similar Papers
No similar papers found.