Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models

šŸ“… 2025-08-12
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This study investigates how large language models (LLMs) overgeneralize low-resource cultures in their internal representations due to training data biases, leading to distorted cultural understanding. To address this, we propose CultureScope—the first mechanistic interpretability–based framework for cultural bias analysis. It constructs a cultural knowledge space via representation patching, defines a ā€œcultural flattening scoreā€ to quantify bias severity, and traces its emergence across model layers. Experiments reveal that LLMs’ cultural knowledge spaces exhibit pronounced Western dominance and systematic cultural flattening. Counterintuitively, low-resource cultures—despite sparse representations—demonstrate lower bias sensitivity. This work pioneers the systematic application of mechanistic interpretability to cultural bias research, establishing a novel paradigm for evaluating and intervening in model cultural robustness.

Technology Category

Application Category

šŸ“ Abstract
The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs' representations impacts their cultural understanding. Prior work only performs extrinsic evaluation of LLMs' cultural competence, without accounting for how LLMs' internal mechanisms lead to cultural (mis)representation. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of LLMs to elicit the underlying cultural knowledge space. CultureScope utilizes a patching method to extract the cultural knowledge. We introduce a cultural flattening score as a measure of the intrinsic cultural biases. Additionally, we study how LLMs internalize Western-dominance bias and cultural flattening, which allows us to trace how cultural biases emerge within LLMs. Our experimental results reveal that LLMs encode Western-dominance bias and cultural flattening in their cultural knowledge space. We find that low-resource cultures are less susceptible to cultural biases, likely due to their limited training resources. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs' cultural understanding. Our codes and data used for experiments are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Investigates cultural biases in large language models' internal representations
Measures intrinsic cultural biases using a cultural flattening score
Traces emergence of Western-dominance bias and cultural flattening in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mechanistic interpretability probes LLMs' cultural representations
Cultural flattening score measures intrinsic biases
Patching method extracts cultural knowledge space
šŸ”Ž Similar Papers
No similar papers found.