Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This paper addresses the challenge of aligning large language models’ (LLMs) internal values with human values. We propose Controllable Value Activation (ConVA), a parameter-free, architecture-agnostic method that identifies context-aware value vectors and applies gated activation to precisely locate and dynamically modulate value-relevant directions in the model’s latent space. ConVA requires no fine-tuning or architectural modification, enabling low-intervention, high-fidelity value alignment. Evaluated on ten foundational value-oriented tasks, ConVA significantly improves control success rates while preserving generation fluency and semantic coherence, and demonstrates strong robustness against adversarial prompts. Its core contribution is the first formulation of value alignment as an interpretable, controllable vector operation in latent space—establishing a new paradigm for fine-grained, scenario-adaptive value guidance in LLMs.

Technology Category

Application Category

📝 Abstract

Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at~ https://github.com/hr-jin/ConVA.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with human values for clarity and adaptability

Interpreting and modifying latent value representations in LLMs

Ensuring consistent values without harming model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled Value Vector Activation method

Context-controlled value vector identification

Gated value vector activation control

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning