🤖 AI Summary
This study investigates large language models’ (LLMs) capacity to model cross-cultural moral value differences. Method: We systematically evaluate LLM outputs against authoritative human moral attitude datasets—including the World Values Survey and Pew Global Attitudes Project—using a novel log-probability–based moral defensibility scoring framework that enables culture-sensitive, quantitative assessment of moral judgments across countries and ethical domains. Contribution/Results: Experiments span over ten models—from GPT-2, OPT, and BLOOMZ to Qwen, GPT-4o, Gemma-2, and Llama-3.3—revealing that instruction tuning markedly outperforms mere parameter scaling: advanced instruction-tuned models (e.g., GPT-4o) exhibit significant positive correlations (r ≤ 0.62) with human survey responses across most ethical topics, whereas early smaller models show near-zero or negative correlations. These findings establish instruction tuning as a critical pathway for enhancing cultural alignment in LLMs’ moral reasoning.
📝 Abstract
Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs can mirror variations in moral attitudes reported by two major cross-cultural surveys: the World Values Survey and the PEW Research Center's Global Attitudes Survey. We compare smaller, monolingual, and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with more recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based moral justifiability scores, we correlate each model's outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models (including GPT-4o and GPT-4o-mini) achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. While scaling up model size and using instruction tuning can improve alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, and strategies for improving the cultural sensitivity of LLMs.