AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between secure refusal under jailbreaking attacks and erroneous refusal on benign prompts in large language models (LLMs), this paper proposes an activation-guided dual-objective safety alignment method. The approach jointly optimizes safety and utility by formulating activation guidance as a theoretically grounded constrained optimization problem. Specifically, it introduces a null-space projection constraint to preserve benign input representations with negligible degradation, while employing supervised linear regression to precisely learn task-agnostic refusal directions. Evaluated across diverse jailbreaking attack types, the method achieves significantly improved safety rates without sacrificing performance on general benchmarks—including MMLU and BBH—thereby retaining the model’s original capabilities. Moreover, it demonstrates superior robustness and generalization compared to existing activation-guided and calibration-based safety alignment methods.

Technology Category

Application Category

📝 Abstract
As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at https://github.com/AlphaLab-USTC/AlphaSteer.
Problem

Research questions and friction points this paper is trying to address.

Ensuring LLMs refuse malicious prompts safely.
Balancing safety and utility in activation steering.
Learning refusal steering with null-space constraints.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable activation steering with null-space constraints
Utility preservation via nearly zero vector construction
Safety enhancement through linear regression steering
🔎 Similar Papers
No similar papers found.