Exploring Cultural Variations in Moral Judgments with Large Language Models

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study investigates large language models’ (LLMs) capacity to model cross-cultural moral value differences. Method: We systematically evaluate LLM outputs against authoritative human moral attitude datasets—including the World Values Survey and Pew Global Attitudes Project—using a novel log-probability–based moral defensibility scoring framework that enables culture-sensitive, quantitative assessment of moral judgments across countries and ethical domains. Contribution/Results: Experiments span over ten models—from GPT-2, OPT, and BLOOMZ to Qwen, GPT-4o, Gemma-2, and Llama-3.3—revealing that instruction tuning markedly outperforms mere parameter scaling: advanced instruction-tuned models (e.g., GPT-4o) exhibit significant positive correlations (r ≤ 0.62) with human survey responses across most ethical topics, whereas early smaller models show near-zero or negative correlations. These findings establish instruction tuning as a critical pathway for enhancing cultural alignment in LLMs’ moral reasoning.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs can mirror variations in moral attitudes reported by two major cross-cultural surveys: the World Values Survey and the PEW Research Center's Global Attitudes Survey. We compare smaller, monolingual, and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with more recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based moral justifiability scores, we correlate each model's outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models (including GPT-4o and GPT-4o-mini) achieve substantially higher positive correlations, suggesting they better reflect real-world moral attitudes. While scaling up model size and using instruction tuning can improve alignment with cross-cultural moral norms, challenges remain for certain topics and regions. We discuss these findings in relation to bias analysis, training data diversity, and strategies for improving the cultural sensitivity of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to reflect culturally diverse moral values

Comparing model correlations with cross-cultural moral survey data

Identifying challenges in aligning LLMs with regional moral norms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using log-probability-based moral justifiability scores

Comparing monolingual and multilingual models with instruction-tuned models

Correlating model outputs with cross-cultural survey data

🔎 Similar Papers

A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges