AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

📅 2025-04-13

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to jailbreaking attacks, and existing activation-guided defenses suffer from suboptimal safety-utility trade-offs due to fixed guidance coefficients. Method: We propose a training-free, adaptive activation-guidance framework. We formally introduce the “Rejection Law” and “Harmfulness Law,” enabling real-time, interpretable dual-directional (Rejection Direction / Harmfulness Direction) activation guidance grounded in input semantics. Guidance strength is dynamically adjusted via activation-space projection and zero-shot coefficient learning driven by logistic regression. Results: Evaluated on LLaMA-3.1, Gemma-2, and Qwen2.5, our method improves defense rates against mainstream jailbreaking attacks by 12–37%, reduces false rejection rates on benign queries to under 0.8%, and incurs negligible inference overhead.

Technology Category

Application Category

📝 Abstract

Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Defends LLMs against adaptive jailbreak attacks dynamically

Reduces false rejections of benign inputs with adaptive steering

Uses interpretable model internals for real-time safety enforcement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive activation steering for dynamic defense

Rejection and Harmfulness Laws guide steering

Logistic regression optimizes adaptive coefficients

🔎 Similar Papers

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks