🤖 AI Summary
Existing LLM cultural understanding evaluation benchmarks suffer from weak theoretical foundations, poor cross-cultural scalability, and heavy reliance on manual annotation.
Method: Grounded in Hofstede’s Iceberg Theory of Culture, we propose the first systematic, scalable, three-layered, 140-dimensional cultural understanding evaluation framework. It features a fine-grained, theory-driven dimensional schema, automated multilingual knowledge base construction, and synthetic benchmark data generation via cultural dimension modeling, large-scale knowledge extraction, and LLM-based evaluation techniques.
Contribution/Results: Experiments reveal that mainstream LLMs—despite multilingual training—show no significant improvement in deep cultural understanding. Our benchmark covers diverse cultural contexts and empirically exposes widespread deficits in implicit cultural cognition across models. This work establishes the first theoretically grounded, large-scale, and reproducible evaluation benchmark for cultural alignment research.
📝 Abstract
As large language models (LLMs) are increasingly deployed in diverse cultural environments, evaluating their cultural understanding capability has become essential for ensuring trustworthy and culturally aligned applications. However, most existing benchmarks lack comprehensiveness and are challenging to scale and adapt across different cultural contexts, because their frameworks often lack guidance from well-established cultural theories and tend to rely on expert-driven manual annotations. To address these issues, we propose CultureScope, the most comprehensive evaluation framework to date for assessing cultural understanding in LLMs. Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification, comprising 3 layers and 140 dimensions, which guides the automated construction of culture-specific knowledge bases and corresponding evaluation datasets for any given languages and cultures. Experimental results demonstrate that our method can effectively evaluate cultural understanding. They also reveal that existing large language models lack comprehensive cultural competence, and merely incorporating multilingual data does not necessarily enhance cultural understanding. All code and data files are available at https://github.com/HoganZinger/Culture