🤖 AI Summary
Large language models (LLMs) often exhibit fragile safety alignment—vulnerable to jailbreaking and ineffective at rejecting malicious queries. Method: We propose a lightweight, fine-tuning-free safety alignment enhancement method grounded in white-box interpretability. It identifies a rank-one safety direction within the residual stream—derived solely from the model’s native weights—and enforces robust refusal of harmful requests via targeted matrix injection. Contribution/Results: Our approach is the first to leverage an intrinsic safety subspace for reverse-guided realignment, enabling zero-shot repair of jailbroken or misaligned models without retraining. Evaluated on Llama Guard 3, it significantly improves harmful query rejection rates while preserving original performance on general benchmarks—including MMLU, HellaSwag, and ARC—demonstrating effectiveness, computational efficiency, and deployment readiness.
📝 Abstract
Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.