Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large language models (LLMs) implicitly encode social values that may induce bias and harmful behavior, yet their underlying neural mechanisms remain poorly understood. Method: We propose ValueExploration, the first neuron-level interpretability framework for decoding social values, integrating activation-difference localization, targeted neuron ablation, and cross-model comparative analysis. We introduce C-voice, a bilingual evaluation benchmark centered on Chinese social values. Results: We identify cross-model-stable “value-encoding neuron clusters”; ablating these neurons reduces value consistency by 38.6% on average. The framework demonstrates consistent efficacy across four major LLM families. This work bridges the critical gap between behavioral value alignment and neural-level attribution, establishing a novel paradigm for value-aware model interpretability and alignment.

Technology Category

Application Category

📝 Abstract

Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.

Problem

Research questions and friction points this paper is trying to address.

Understanding neural mechanisms behind value-driven behaviors in LLMs

Assessing social values in LLMs with interpretability and real-world context

Identifying and deactivating neurons encoding harmful biases in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes ValueExploration framework for neuron-level analysis

Uses C-voice benchmark to identify value-encoding neurons

Deactivates neurons to study value-driven behavior shifts

🔎 Similar Papers

A Theory of LLM Sampling: Part Descriptive and Part Prescriptive