🤖 AI Summary
Existing activation steering methods rely on globally fixed interventions, which lack fine-grained control and often compromise model utility. This work proposes Steer2Edit, a novel framework that establishes, for the first time, a theoretical connection between activation steering and weight editing. By reinterpreting steering vectors as diagnostic signals, Steer2Edit enables component-level rank-1 weight edits to attention heads and MLP neurons without any training, achieving localized and interpretable behavioral control while preserving standard forward propagation and inference efficiency. Experiments demonstrate that Steer2Edit significantly outperforms baselines across safety alignment, hallucination mitigation, and inference efficiency tasks—improving safety by up to 17.2%, factual accuracy by 9.8%, and reducing average inference length by 12.2%, all without degrading downstream performance.
📝 Abstract
Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.