Steering Language Model Refusal with Sparse Autoencoders

📅 2024-11-18
🏛️ arXiv.org
📈 Citations: 12
Influential: 1
📄 PDF
🤖 AI Summary
This work investigates the intrinsic trade-off between safety refusal capability and general language competence in large language models. Addressing the challenge of dynamically enhancing refusal robustness *without modifying model weights*, we propose a test-time feature intervention method based on sparse autoencoders (SAEs): critical SAE features correlated with refusal behavior are selectively amplified to dynamically filter unsafe prompts. Experiments demonstrate that this steering strategy significantly improves refusal success rates against both single-turn and multi-turn jailbreak attacks; however, it systematically degrades performance across multiple general-purpose benchmarks. This constitutes the first empirical evidence that safety refusal features and foundational language capabilities are deeply entangled in the activation space. The findings fundamentally challenge the prevailing assumption of “safety–capability decoupling” and provide critical theoretical insight and empirical grounding for safety-intervention paradigms in trustworthy AI.

Technology Category

Application Category

📝 Abstract
Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we explore an alternative: steering model activations at inference time via amplifying sparse autoencoder (SAE) features that mediate refusal. This work uncovers a fundamental tension between SAE steering-based safety improvements and general model capabilities. While feature steering successfully improves robustness against both single-turn and challenging multi-turn jailbreak attempts, we discover that this comes at a previously underexplored cost -- systematic degradation of performance across multiple benchmark tasks, even on safe inputs with no apparent connection to refusal behavior. This suggests that features mediating refusal may be more deeply entangled with general language model capabilities than previously understood. Our findings reveal important open questions about the nature of safety-relevant features in language models and the feasibility of isolating them for targeted intervention. While SAE-based steering shows promise as a flexible approach to enhancing language model safety, our results highlight the critical need to understand and address the mechanisms behind these capability tradeoffs before such techniques can be practically deployed.
Problem

Research questions and friction points this paper is trying to address.

Steering language model refusal without altering weights
Balancing safety improvements with model capability tradeoffs
Understanding entanglement of safety features and general performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Steering model activations via sparse autoencoders
Amplifying refusal-mediating SAE features
Balancing safety improvements with capability tradeoffs
🔎 Similar Papers
2024-02-15International Conference on Machine LearningCitations: 14
K
Kyle O'Brien
Microsoft
D
D. Majercak
Microsoft
X
Xavier Fernandes
Microsoft
R
Richard Edgar
Microsoft
J
Jingya Chen
Microsoft
Harsha Nori
Harsha Nori
Microsoft AI
HealthcareInterpretabilityDifferential Privacy
D
Dean Carignan
Microsoft
Eric Horvitz
Eric Horvitz
Microsoft
Machine intelligencedecision theorydecisions under uncertaintyinformation retrievalbounded
F
Forough Poursabzi-Sangde
Microsoft