🤖 AI Summary
Current cultural evaluation benchmarks predominantly reduce culture to static facts or homogenized values, neglecting its dynamism, historical situatedness, and embeddedness in practice—contradicting foundational anthropological principles. Method: This paper pioneers the integration of anthropological theory to construct a four-dimensional cultural evaluation framework; it systematically identifies six methodological flaws (e.g., “nation-as-culture” fallacy, erasure of intracultural diversity) through qualitative analysis of 20 existing benchmarks. Contribution/Results: The study proposes a tripartite improvement pathway centered on authentic contextual narratives, community-coordinated benchmark design, and practice-oriented assessment. It advances cultural evaluation from memory-based factual recall toward situated, responsive practice—establishing both a theoretical foundation and an actionable paradigm for developing more authentic, pluralistic, and dynamic cultural assessment systems.
📝 Abstract
Cultural evaluation of large language models has become increasingly important, yet current benchmarks often reduce culture to static facts or homogeneous values. This view conflicts with anthropological accounts that emphasize culture as dynamic, historically situated, and enacted in practice. To analyze this gap, we introduce a four-part framework that categorizes how benchmarks frame culture, such as knowledge, preference, performance, or bias. Using this lens, we qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues, including treating countries as cultures, overlooking within-culture diversity, and relying on oversimplified survey formats. Drawing on established anthropological methods, we propose concrete improvements: incorporating real-world narratives and scenarios, involving cultural communities in design and validation, and evaluating models in context rather than isolation. Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks and more accurately capture the responses of the models to complex cultural situations.