Do Large Language Models Truly Understand Cross-cultural Differences?

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether large language models (LLMs) genuinely comprehend cross-cultural differences. Method: We introduce SAGE, a theoretically grounded benchmark built on cultural theory, featuring a nine-dimensional competency taxonomy covering 15 real-world cross-cultural scenarios; it integrates 210 core cultural concepts and 4,530 contextualized, generative test items, supporting multilingual extension. Our methodology innovatively combines theory-driven concept alignment, generative task design, and standardized item construction. Contribution/Results: Experiments demonstrate SAGE’s strong cross-lingual transferability and— for the first time—systematically expose LLMs’ pervasive deficiencies in deep cultural reasoning, including value trade-off analysis and contextual metaphor interpretation. As the first open-source, scalable, and theory-anchored evaluation framework for cross-cultural AI capabilities, SAGE provides a rigorous foundation for diagnosing model limitations and guiding targeted improvements in culturally intelligent language modeling.

Technology Category

Application Category

📝 Abstract
In recent years, large language models (LLMs) have demonstrated strong performance on multilingual tasks. Given its wide range of applications, cross-cultural understanding capability is a crucial competency. However, existing benchmarks for evaluating whether LLMs genuinely possess this capability suffer from three key limitations: a lack of contextual scenarios, insufficient cross-cultural concept mapping, and limited deep cultural reasoning capabilities. To address these gaps, we propose SAGE, a scenario-based benchmark built via cross-cultural core concept alignment and generative task design, to evaluate LLMs' cross-cultural understanding and reasoning. Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions. Using this framework, we curated 210 core concepts and constructed 4530 test items across 15 specific real-world scenarios, organized under four broader categories of cross-cultural situations, following established item design principles. The SAGE dataset supports continuous expansion, and experiments confirm its transferability to other languages. It reveals model weaknesses across both dimensions and scenarios, exposing systematic limitations in cross-cultural reasoning. While progress has been made, LLMs are still some distance away from reaching a truly nuanced cross-cultural understanding. In compliance with the anonymity policy, we include data and code in the supplement materials. In future versions, we will make them publicly available online.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' cross-cultural understanding and reasoning
Addresses limitations in existing benchmarks for cultural assessment
Proposes SAGE benchmark with scenario-based test items
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scenario-based benchmark for cross-cultural evaluation
Core concept alignment across cultural dimensions
Generative task design for reasoning assessment
🔎 Similar Papers
No similar papers found.
S
Shiwei Guo
School of Foreign Languages and Literatures, Fudan University
Sihang Jiang
Sihang Jiang
Fudan University
Knowledge GraphLarge Language Models
Qianxi He
Qianxi He
复旦大学计算机科学技术学院
Y
Yanghua Xiao
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
Jiaqing Liang
Jiaqing Liang
Fudan University
knowledge graphdeep learning
B
Bi Yude
School of Foreign Languages and Literatures, Fudan University
M
Minggui He
Huawei, China
Shimin Tao
Shimin Tao
2012 Lab, Huawei co. LTD
Machine Translation AIOps Log Analysis
L
Li Zhang
Huawei, China