Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the fundamental challenge of balancing safety and utility in large language models (LLMs). We propose an interpretable feature modulation approach for safety alignment. First, sparse autoencoders (SAEs) are applied to extract semantically distinct safety-relevant features from the hidden layers of Llama-3 8B. Second, a contrastive prompting mechanism—guided by principled feature selection—is introduced to enable fine-grained, controllable intervention on refusal behavior. Third, we evaluate the method on the OpenHermes-2.5-Mistral-7B prompt set and the Air Bench EU-dataset. Experiments demonstrate that our approach improves safety performance by 18.9% while simultaneously increasing utility by 11.1%, all while maintaining safe responses. Crucially, this is the first method to systematically overcome the safety–utility trade-off without relying on reinforcement learning or model fine-tuning, establishing a new paradigm for efficient, transparent, and scalable safety alignment.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) deployment requires guiding the LLM to recognize and not answer unsafe prompts while complying with safe prompts. Previous methods for achieving this require adjusting model weights along with other expensive procedures. While recent advances in Sparse Autoencoders (SAEs) have enabled interpretable feature extraction from LLMs, existing approaches lack systematic feature selection methods and principled evaluation of safety-utility tradeoffs. We explored using different steering features and steering strengths using Sparse Auto Encoders (SAEs) to provide a solution. Using an accurate and innovative contrasting prompt method with the AI-Generated Prompts Dataset from teknium/OpenHermes-2p5-Mistral-7B and Air Bench eu-dataset to efficiently choose the best features in the model to steer, we tested this method on Llama-3 8B. We conclude that using this method, our approach achieves an 18.9% improvement in safety performance while simultaneously increasing utility by 11.1%, demonstrating that targeted SAE steering can overcome traditional safety-utility tradeoffs when optimal features are identified through principled selection methods.

Problem

Research questions and friction points this paper is trying to address.

Guiding LLMs to refuse unsafe prompts while answering safe ones

Overcoming expensive procedures and lack of systematic feature selection methods

Improving safety performance while simultaneously increasing utility metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature-guided steering using Sparse Autoencoders for LLM control

Contrasting prompt method to identify optimal safety features

Targeted steering improves safety and utility simultaneously

🔎 Similar Papers

Programming Refusal with Conditional Activation Steering