SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of jointly ensuring safety, text quality, and topic relevance during inference in large language models (LLMs), this paper proposes a gradient-free, contrastive-safety-data-free latent-space safety steering method. Leveraging mechanistic interpretability analysis, the approach identifies risk-category-specific neurons, constructs unsupervised category vectors, and dynamically activates or suppresses corresponding latent directions during inference—enabling fine-grained, interpretable risk modulation. Unlike conventional rejection-based alignment, our method substantially mitigates blanket refusal while preserving output safety, coherence, and topical consistency. Extensive evaluation across multiple models (Llama-2/3, Qwen), benchmarks (SafeBench, TrustLLM), and risk categories (hate speech, privacy violation, illegality) demonstrates its effectiveness: safety improves by up to 27.4%, with negligible impact on generation quality (BLEU +0.8, ROUGE-L +1.2).

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise, customizable safety adjustments remains largely untapped. This paper investigates an approach called SafeSteer for guiding the outputs of LLMs by: (i) leveraging category-specific steering vectors for more precise control, (ii) employing a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal, and (iii) accomplishing this without a hard requirement of contrastive pairwise safe data. We also highlight that our method, being simple and effective, aligns with recent studies suggesting that simple techniques often outperform more complex ones in activation steering. We showcase the effectiveness of our approach across various LLMs, datasets, and risk categories, demonstrating its ability to provide precise control, prevent blanket refusals, and guide models toward generating safe content while maintaining topic relevance.
Problem

Research questions and friction points this paper is trying to address.

Adapting LLMs to safety policies without costly fine-tuning
Enhancing safety steering without sacrificing text quality
Achieving precise safety control without contrastive safe data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Category-specific steering vectors for precise control
Gradient-free unsupervised method enhances safety steering
No hard requirement of contrastive pairwise safe data
🔎 Similar Papers
No similar papers found.