🤖 AI Summary
This study addresses a critical gap in the cultural evaluation of large language models (LLMs), which has predominantly emphasized diversity and factual accuracy while overlooking local populations’ perceptions of cultural value priorities. To bridge this gap, the authors propose a human-centered evaluation framework that constructs “cultural importance vectors” from open-ended survey responses across nine countries as human benchmarks. They design a syntactically diverse prompt set to elicit corresponding “cultural representation vectors” from three state-of-the-art LLMs and quantify alignment between model outputs and local cultural expectations through vector similarity. Introducing, for the first time, a cultural importance–representation alignment mechanism, this approach transcends limitations of conventional metrics. Empirical results reveal a pervasive Western-centric bias: alignment decreases with greater cultural distance from the United States, and all models consistently (ρ > 0.97) overemphasize superficial cultural symbols while underrepresenting deeper societal values.
📝 Abstract
Cultural representation in Large Language Model (LLM) outputs has primarily been evaluated through the proxies of cultural diversity and factual accuracy. However, a crucial gap remains in assessing cultural alignment: the degree to which generated content mirrors how native populations perceive and prioritize their own cultural facets. In this paper, we introduce a human-centered framework to evaluate the alignment of LLM generations with local expectations. First, we establish a human-derived ground-truth baseline of importance vectors, called Cultural Importance Vectors based on an induced set of culturally significant facets from open-ended survey responses collected across nine countries. Next, we introduce a method to compute model-derived Cultural Representation Vectors of an LLM based on a syntactically diversified prompt-set and apply it to three frontier LLMs (Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku). Our investigation of the alignment between the human-derived Cultural Importance and model-derived Cultural Representations reveals a Western-centric calibration for some of the models where alignment decreases as a country's cultural distance from the US increases. Furthermore, we identify highly correlated, systemic error signatures ($ρ> 0.97$) across all models, which over-index on some cultural markers while neglecting the deep-seated social and value-based priorities of users. Our approach moves beyond simple diversity metrics toward evaluating the fidelity of AI-generated content in authentically capturing the nuanced hierarchies of global cultures.