Causal Graph Guided Steering of LLM Values via Prompts and Sparse Autoencoders

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak fine-grained controllability, limited interpretability, and high computational overhead in value alignment for large language models (LLMs), this paper proposes a lightweight steering framework grounded in an implicit value causal graph. We first extract causal relationships among multidimensional values directly from internal LLM representations to construct an interpretable causal graph. Subsequently, we design a dual-path intervention mechanism: (i) structured prompt templates for high-level value guidance, and (ii) sparse autoencoder (SAE)-driven targeted intervention on value-specific features. Experiments on Gemma-2B-IT and Llama3-8B-IT demonstrate that our method significantly improves cross-dimensional value controllability and consistency compared to conventional RLHF baselines, while incurring substantially lower computational cost and offering strong mechanistic interpretability through the causal graph and sparse feature attribution.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), often focus on a limited set of values and can be resource-intensive. Furthermore, the correlation between values has been largely overlooked and remains underutilized. Our framework addresses this limitation by mining a causal graph that elucidates the implicit relationships among various values within the LLMs. Leveraging the causal graph, we implement two lightweight mechanisms for value steering: prompt template steering and Sparse Autoencoder feature steering, and analyze the effects of altering one value dimension on others. Extensive experiments conducted on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our steering methods.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Human Values Alignment
Complex Idea Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Graphs
Sparse Autoencoders
Resource-efficient Human-alignment
Yipeng Kang
Yipeng Kang
BIGAI
Natural language processing
J
Junqi Wang
State Key Laboratory of General Artificial Intelligence, BIGAI
Yexin Li
Yexin Li
State Key Laboratory of General Artificial Intelligence BIGAI
reinforcement learningmulti-agent systemmulti-armed banditsdata mining
Fangwei Zhong
Fangwei Zhong
Beijing Normal University
Embodied AIRobot LearningMulti-Agent LearningComputer Vision
X
Xue Feng
State Key Laboratory of General Artificial Intelligence, BIGAI
M
Mengmeng Wang
State Key Laboratory of General Artificial Intelligence, BIGAI
W
Wenming Tu
State Key Laboratory of General Artificial Intelligence, BIGAI
Q
Quansen Wang
State Key Laboratory of General Artificial Intelligence, BIGAI; Peking University
Hengli Li
Hengli Li
Institute for Artificial Intelligence, Peking University
Machine LearningNatural Language Processing
Z
Zilong Zheng
State Key Laboratory of General Artificial Intelligence, BIGAI