Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Large language models (LLMs) often exhibit fragile safety alignment—vulnerable to jailbreaking and ineffective at rejecting malicious queries. Method: We propose a lightweight, fine-tuning-free safety alignment enhancement method grounded in white-box interpretability. It identifies a rank-one safety direction within the residual stream—derived solely from the model’s native weights—and enforces robust refusal of harmful requests via targeted matrix injection. Contribution/Results: Our approach is the first to leverage an intrinsic safety subspace for reverse-guided realignment, enabling zero-shot repair of jailbroken or misaligned models without retraining. Evaluated on Llama Guard 3, it significantly improves harmful query rejection rates while preserving original performance on general benchmarks—including MMLU, HellaSwag, and ARC—demonstrating effectiveness, computational efficiency, and deployment readiness.

Technology Category

Application Category

📝 Abstract

Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.

Problem

Research questions and friction points this paper is trying to address.

Bypassing safety mechanisms in Large Language Models

Amplifying safety alignment without fine-tuning

Steering model activations to refuse harmful requests

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rank-One Safety Injection method

Fine-tuning-free weight modification

Steering activations toward refusal subspace

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?