Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT-4

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study reveals implicit cultural biases in GPT-4’s cross-cultural social norm reasoning—specifically, its tendency to generate overgeneralized, low-specificity norms and its susceptibility to stable activation of stereotypical associations via minimal prompting. Method: Moving beyond conventional values-based surveys, we introduce a novel “bottom-up,” narrative-driven evaluation paradigm grounded in multicultural scenario cases, comprehension-oriented prompt engineering, and a mixed qualitative–quantitative bias detection framework. Contribution/Results: We demonstrate that while GPT-4 avoids overtly discriminatory language, it perpetuates cultural bias through concealment rather than elimination. Its normative outputs systematically attenuate geographical and group-specific nuance, exhibiting statistically robust, reproducible bias patterns. This work establishes both a methodological innovation—a context-sensitive, narrative-based assessment protocol—and an empirical benchmark for evaluating cultural fairness in large language models.

Technology Category

Application Category

📝 Abstract

LLMs have been demonstrated to align with the values of Western or North American cultures. Prior work predominantly showed this effect through leveraging surveys that directly ask (originally people and now also LLMs) about their values. However, it is hard to believe that LLMs would consistently apply those values in real-world scenarios. To address that, we take a bottom-up approach, asking LLMs to reason about cultural norms in narratives from different cultures. We find that GPT-4 tends to generate norms that, while not necessarily incorrect, are significantly less culture-specific. In addition, while it avoids overtly generating stereotypes, the stereotypical representations of certain cultures are merely hidden rather than suppressed in the model, and such stereotypes can be easily recovered. Addressing these challenges is a crucial step towards developing LLMs that fairly serve their diverse user base.

Problem

Research questions and friction points this paper is trying to address.

Evaluating cultural bias in GPT-4's norm generation

Assessing stereotype concealment versus suppression in LLMs

Ensuring fair LLM alignment with diverse cultural values

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bottom-up approach for cultural norm evaluation

Analyzing GPT-4's culture-specific norm generation

Identifying hidden stereotypes in LLM outputs

🔎 Similar Papers

How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions