XCR-Bench: A Multi-Task Benchmark for Evaluating Cultural Reasoning in LLMs

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the lack of systematic benchmarks for evaluating cross-cultural competence in large language models (LLMs), primarily due to the scarcity of high-quality parallel corpora annotated with culture-specific items (CSIs). To bridge this gap, the authors introduce XCR-Bench, a novel multitask benchmark that integrates Newmark’s CSI taxonomy and Hall’s triadic cultural framework into LLM evaluation. Comprising 4.9k parallel sentence pairs and 1,098 unique CSIs, the benchmark spans deep cultural dimensions such as social norms, beliefs, and values. It enables both quantitative and qualitative analysis of LLMs’ ability to comprehend and adapt to explicit and implicit cultural elements. Evaluations reveal significant weaknesses in mainstream models on tasks involving social etiquette and cultural references, while also uncovering persistent regional and ethno-religious biases—even within monolingual settings.

Technology Category

Application Category

📝 Abstract

Cross-cultural competence in large language models (LLMs) requires the ability to identify Culture-Specific Items (CSIs) and to adapt them appropriately across cultural contexts. Progress in evaluating this capability has been constrained by the scarcity of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. To address this limitation, we introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark consisting of 4.9k parallel sentences and 1,098 unique CSIs, spanning three distinct reasoning tasks with corresponding evaluation metrics. Our corpus integrates Newmark's CSI framework with Hall's Triad of Culture, enabling systematic analysis of cultural reasoning beyond surface-level artifacts and into semi-visible and invisible cultural elements such as social norms, beliefs, and values. Our findings show that state-of-the-art LLMs exhibit consistent weaknesses in identifying and adapting CSIs related to social etiquette and cultural reference. Additionally, we find evidence that LLMs encode regional and ethno-religious biases even within a single linguistic setting during cultural adaptation. We release our corpus and code to facilitate future research on cross-cultural NLP.

Problem

Research questions and friction points this paper is trying to address.

Cultural Reasoning

Culture-Specific Items

Cross-Cultural NLP

Large Language Models

Cultural Bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-cultural Reasoning

Culture-Specific Items (CSIs)

Multitask Benchmark