Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses the limitations of existing fine-tuning-based steering vectors, which rely on manually selected steering factors and apply interventions across the entire sequence, often degrading generation quality. To overcome these issues, the authors propose Prompt-only Steering Vectors (PrOSV), a lightweight, hyperparameter-free approach that jointly trains steering factors and directions and applies interventions exclusively during the prompt phase. PrOSV builds upon representation fine-tuning principles and leverages neural scaling laws to guide hyperparameter selection. Experimental results on AxBench demonstrate that PrOSV outperforms conventional full-sequence steering methods, achieving a superior trade-off between model generalization and adversarial robustness.
📝 Abstract
Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration from representation fine-tuning and introduce Prompt-only SV (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.
Problem

Research questions and friction points this paper is trying to address.

steering vectors
large language models
generation quality
prompt-only interventions
full-sequence steering
Innovation

Methods, ideas, or system contributions that make the work stand out.

steering vectors
joint training
Prompt-only SV
representation fine-tuning
adversarial robustness