🤖 AI Summary
Existing LLM-based moral reasoning evaluation suffers from two key limitations: (1) a lack of interpretable, fine-grained moral annotations, and (2) heavy reliance on English, hindering cross-cultural assessment. To address these, we introduce the first multilingual, multi-hop hate speech explanation dataset grounded in Moral Foundations Theory (MFT), covering English, Portuguese, Italian, and Persian. Our dataset innovatively incorporates text-span-level rationale annotations, enabling three-dimensional fine-grained evaluation: binary hate speech detection, moral foundation classification, and rationale extraction. Experimental results reveal that current LLMs achieve robust performance on hate speech detection (F1 = 0.836) but exhibit severe deficiencies in moral sentiment prediction (F1 < 0.35) and rationale-decision alignment—particularly for low-resource languages—exposing fundamental weaknesses in cross-lingual moral reasoning capabilities.
📝 Abstract
Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.