🤖 AI Summary
Large language models (LLMs) exhibit pervasive cultural biases and insufficient understanding of minority cultures, yet the underlying neural mechanisms remain poorly understood. To address this, we propose a gradient-based scoring and iterative filtering method to localize culture-relevant neurons at the single-neuron level. Our analysis identifies fewer than 1% of neurons—concentrated in shallow-to-mid MLP layers—as critical for cultural understanding. We discover a hierarchical organization: culturally general and culturally specific neurons coexist, with the latter demonstrating cross-cultural generalization capability. Ablation experiments show that suppressing these neurons degrades performance on cultural benchmarks by up to 30%, while preserving core linguistic capabilities. Furthermore, our findings suggest that standard training procedures may inadvertently erode cultural understanding. To support reproducible, interpretable, and intervention-aware optimization of cultural fairness, we publicly release our codebase. This work provides the first neuron-level mechanistic account of cultural cognition in LLMs and establishes a foundation for targeted, explainable interventions.
📝 Abstract
As large language models (LLMs) are increasingly deployed worldwide, ensuring their fair and comprehensive cultural understanding is important. However, LLMs exhibit cultural bias and limited awareness of underrepresented cultures, while the mechanisms underlying their cultural understanding remain underexplored. To fill this gap, we conduct a neuron-level analysis to identify neurons that drive cultural behavior, introducing a gradient-based scoring method with additional filtering for precise refinement. We identify both culture-general neurons contributing to cultural understanding regardless of cultures, and culture-specific neurons tied to an individual culture. These neurons account for less than 1% of all neurons and are concentrated in shallow to middle MLP layers. We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected. Moreover, we show that culture-specific neurons support knowledge of not only the target culture, but also related cultures. Finally, we demonstrate that training on NLU benchmarks can diminish models'cultural understanding when we update modules containing many culture-general neurons. These findings provide insights into the internal mechanisms of LLMs and offer practical guidance for model training and engineering. Our code is available at https://github.com/ynklab/CULNIG