🤖 AI Summary
To address the challenge of jointly ensuring safety, text quality, and topic relevance during inference in large language models (LLMs), this paper proposes a gradient-free, contrastive-safety-data-free latent-space safety steering method. Leveraging mechanistic interpretability analysis, the approach identifies risk-category-specific neurons, constructs unsupervised category vectors, and dynamically activates or suppresses corresponding latent directions during inference—enabling fine-grained, interpretable risk modulation. Unlike conventional rejection-based alignment, our method substantially mitigates blanket refusal while preserving output safety, coherence, and topical consistency. Extensive evaluation across multiple models (Llama-2/3, Qwen), benchmarks (SafeBench, TrustLLM), and risk categories (hate speech, privacy violation, illegality) demonstrates its effectiveness: safety improves by up to 27.4%, with negligible impact on generation quality (BLEU +0.8, ROUGE-L +1.2).
📝 Abstract
Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise, customizable safety adjustments remains largely untapped. This paper investigates an approach called SafeSteer for guiding the outputs of LLMs by: (i) leveraging category-specific steering vectors for more precise control, (ii) employing a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal, and (iii) accomplishing this without a hard requirement of contrastive pairwise safe data. We also highlight that our method, being simple and effective, aligns with recent studies suggesting that simple techniques often outperform more complex ones in activation steering. We showcase the effectiveness of our approach across various LLMs, datasets, and risk categories, demonstrating its ability to provide precise control, prevent blanket refusals, and guide models toward generating safe content while maintaining topic relevance.