CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This study addresses the limitations of existing large language model (LLM) evaluation benchmarks, which predominantly focus on general linguistic comprehension or superficial cultural knowledge and thus fail to assess deep multilingual and multicultural competencies in authentic contexts. To bridge this gap, the authors propose CulturALL, a novel benchmark centered on contextualized tasks. Developed through collaborative efforts between human experts and LLMs, CulturALL comprises 2,610 challenging samples spanning 14 languages, 51 regions, and 16 cultural themes. Emphasizing both factual accuracy and cultural sensitivity, the benchmark reveals significant shortcomings in current models, with the best-performing system achieving only 44.48% accuracy. CulturALL thereby provides a critical tool for advancing research on cross-cultural understanding in artificial intelligence.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.

Problem

Research questions and friction points this paper is trying to address.

multilingual

multicultural

grounded tasks

large language models

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

grounded tasks

multilingual benchmark

multicultural competence