DEAL: Disentangling Transformer Head Activations for LLM Steering

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Precise identification of behaviorally relevant attention heads during LLM inference remains challenging; existing methods often rely on superficial cues or heuristic strategies, resulting in poor controllability. Method: We propose a parameter-free causal attribution framework that introduces a vector-quantized autoencoder (VQ-AE) to achieve interpretable disentanglement of attention head activation spaces. Behavioral relevance is defined at the head level via a binary discrimination criterion—alignment versus violation of target behavior—enabling importance-weighted attribution. Contribution/Results: Our method significantly improves factual consistency control across 7 LLMs and 5 behavior-guided tasks. The identified attention heads exhibit strong cross-domain zero-shot generalization. By eliminating the need for fine-tuning and providing human-interpretable, robust behavioral intervention, our approach establishes a novel paradigm for controllable LLM reasoning.

Technology Category

Application Category

📝 Abstract
Inference-time steering aims to alter the response characteristics of large language models (LLMs) without modifying their underlying parameters. A critical step in this process is the identification of internal modules within LLMs that are associated with the target behavior. However, current approaches to module selection often depend on superficial cues or ad-hoc heuristics, which can result in suboptimal or unintended outcomes. In this work, we propose a principled causal-attribution framework for identifying behavior-relevant attention heads in transformers. For each head, we train a vector-quantized autoencoder (VQ-AE) on its attention activations, partitioning the latent space into behavior-relevant and behavior-irrelevant subspaces, each quantized with a shared learnable codebook. We assess the behavioral relevance of each head by quantifying the separability of VQ-AE encodings for behavior-aligned versus behavior-violating responses using a binary classification metric. This yields a behavioral relevance score that reflects each head discriminative capacity with respect to the target behavior, guiding both selection and importance weighting. Experiments on seven LLMs from two model families and five behavioral steering datasets demonstrate that our method enables more accurate inference-time interventions, achieving superior performance on the truthfulness-steering task. Furthermore, the heads selected by our approach exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.
Problem

Research questions and friction points this paper is trying to address.

Identify behavior-relevant attention heads in transformers
Improve inference-time steering without modifying model parameters
Enhance truthfulness-steering performance in large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal-attribution framework for head identification
Vector-quantized autoencoder partitions latent space
Behavioral relevance scoring guides head selection
🔎 Similar Papers
No similar papers found.
L
Li-Ming Zhan
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong S.A.R.
B
Bo Liu
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong S.A.R.
Zexin Lu
Zexin Lu
Sichuan University
C
Chengqiang Xie
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong S.A.R.
Jiannong Cao
Jiannong Cao
IEEE Fellow; Chair Professor, Hong Kong Polytechnic University
Distributed computingMobile and pervasive computingWireless sensor networksCloud computingBig Data
X
Xiao-Ming Wu
Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong S.A.R.