CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

πŸ“… 2026-04-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

163K/year
πŸ€– AI Summary
This study addresses the limitations of existing large language model (LLM) evaluation benchmarks, which predominantly focus on general linguistic comprehension or superficial cultural knowledge and thus fail to assess deep multilingual and multicultural competencies in authentic contexts. To bridge this gap, the authors propose CulturALL, a novel benchmark centered on contextualized tasks. Developed through collaborative efforts between human experts and LLMs, CulturALL comprises 2,610 challenging samples spanning 14 languages, 51 regions, and 16 cultural themes. Emphasizing both factual accuracy and cultural sensitivity, the benchmark reveals significant shortcomings in current models, with the best-performing system achieving only 44.48% accuracy. CulturALL thereby provides a critical tool for advancing research on cross-cultural understanding in artificial intelligence.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.
Problem

Research questions and friction points this paper is trying to address.

multilingual
multicultural
grounded tasks
large language models
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

grounded tasks
multilingual benchmark
multicultural competence
human-AI collaboration
large language models
Peiqin Lin
Peiqin Lin
LMU Munich
Natural Language ProcessingMultilingualityLanguage ModelingSentiment Analysis
Chenyang Lyu
Chenyang Lyu
Alibaba
Large Language ModelsNatural Language ProcessingMachine Learning
W
Wenjiang Luo
Beijing Language and Culture University
Haotian Ye
Haotian Ye
Computer Science Ph.D. at Stanford University
M
Md Mehrab Hossain
University of Turku
C
Chunlan Ma
LMU Munich
Shaoxiong Ji
Shaoxiong Ji
Technical University of Darmstadt
Machine LearningNatural Language ProcessingHealth Informatics
Younes Samih
Younes Samih
IBM Research AI, IBM
LLMsNLPArabic NLP
Bo Zeng
Bo Zeng
University of Pittsburgh
F
Fan Jiang
Alibaba Group
Y
Yuanbin Cao
Alibaba Group
D
Dilda Duisenbek
Beijing Language and Culture University
A
Adrian Neo Sau Xun
Beijing Language and Culture University
D
Daria Pozdniakova
Beijing Language and Culture University
L
Liubou Misevich
Beijing Language and Culture University
N
Nevena Marinković
Beijing Language and Culture University
N
Ngoc Gia Linh Nguyen
Beijing Language and Culture University
T
Thi Khanh Linh Do
Beijing Language and Culture University
S
Sarakmatak Sophy
Beijing Language and Culture University
Baotian Hu
Baotian Hu
Harbin Institute of Technology (Shenzhen)
LLMMLLMNLP
Guanhua Chen
Guanhua Chen
Assistant Professor, Southern University of Science and Technology
Reasoning LLMsData SynthesisMultimodal
G
Gongbo Tang
Beijing Language and Culture University
Alham Fikri Aji
Alham Fikri Aji
MBZUAI, Monash Indonesia
MultilingualityLow-resource NLPLanguage ModelingMachine Translation
Longyue Wang
Longyue Wang
Alibaba International
Large Language ModelMachine TranslationNatural Language ProcessingLanguange Agent
Weihua Luo
Weihua Luo
Alibaba
natural language processingmachine learningartificial intelligence