π€ AI Summary
Large language models (LLMs) implicitly encode social values that may induce bias and harmful behavior, yet their underlying neural mechanisms remain poorly understood.
Method: We propose ValueExploration, the first neuron-level interpretability framework for decoding social values, integrating activation-difference localization, targeted neuron ablation, and cross-model comparative analysis. We introduce C-voice, a bilingual evaluation benchmark centered on Chinese social values.
Results: We identify cross-model-stable βvalue-encoding neuron clustersβ; ablating these neurons reduces value consistency by 38.6% on average. The framework demonstrates consistent efficacy across four major LLM families. This work bridges the critical gap between behavioral value alignment and neural-level attribution, establishing a novel paradigm for value-aware model interpretability and alignment.
π Abstract
Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.