🤖 AI Summary
This paper identifies critical limitations of closed-ended multiple-choice assessments for evaluating LLMs’ cultural alignment: such metrics are highly sensitive to minor design choices—e.g., answer-option ordering—and fail to reflect genuine cultural adaptability. To address this, the authors propose a novel open, dynamic, and fine-grained evaluation paradigm that integrates World Values Survey (WVS) data, Hofstede’s cultural dimensions theory, and LLM-generated open-ended responses. Cultural alignment is quantified via consistency robustness analysis—measuring coherence and stability of culturally situated reasoning under unconstrained generation. Empirical results show that LLMs exhibit significantly stronger cultural alignment in open-generation settings than in closed-choice ones. The new framework substantially improves assessment stability and cultural sensitivity, systematically exposing the inherent fragility of closed-ended evaluation for cultural alignment. It further establishes an interpretable, scalable proxy metric system to support principled modeling and benchmarking of culturally aligned AI behavior.
📝 Abstract
A large number of studies rely on closed-style multiple-choice surveys to evaluate cultural alignment in Large Language Models (LLMs). In this work, we challenge this constrained evaluation paradigm and explore more realistic, unconstrained approaches. Using the World Values Survey (WVS) and Hofstede Cultural Dimensions as case studies, we demonstrate that LLMs exhibit stronger cultural alignment in less constrained settings, where responses are not forced. Additionally, we show that even minor changes, such as reordering survey choices, lead to inconsistent outputs, exposing the limitations of closed-style evaluations. Our findings advocate for more robust and flexible evaluation frameworks that focus on specific cultural proxies, encouraging more nuanced and accurate assessments of cultural alignment in LLMs.