🤖 AI Summary
This work addresses the fundamental challenge of balancing safety and utility in large language models (LLMs). We propose an interpretable feature modulation approach for safety alignment. First, sparse autoencoders (SAEs) are applied to extract semantically distinct safety-relevant features from the hidden layers of Llama-3 8B. Second, a contrastive prompting mechanism—guided by principled feature selection—is introduced to enable fine-grained, controllable intervention on refusal behavior. Third, we evaluate the method on the OpenHermes-2.5-Mistral-7B prompt set and the Air Bench EU-dataset. Experiments demonstrate that our approach improves safety performance by 18.9% while simultaneously increasing utility by 11.1%, all while maintaining safe responses. Crucially, this is the first method to systematically overcome the safety–utility trade-off without relying on reinforcement learning or model fine-tuning, establishing a new paradigm for efficient, transparent, and scalable safety alignment.
📝 Abstract
Large Language Model (LLM) deployment requires guiding the LLM to recognize and not answer unsafe prompts while complying with safe prompts. Previous methods for achieving this require adjusting model weights along with other expensive procedures. While recent advances in Sparse Autoencoders (SAEs) have enabled interpretable feature extraction from LLMs, existing approaches lack systematic feature selection methods and principled evaluation of safety-utility tradeoffs. We explored using different steering features and steering strengths using Sparse Auto Encoders (SAEs) to provide a solution. Using an accurate and innovative contrasting prompt method with the AI-Generated Prompts Dataset from teknium/OpenHermes-2p5-Mistral-7B and Air Bench eu-dataset to efficiently choose the best features in the model to steer, we tested this method on Llama-3 8B. We conclude that using this method, our approach achieves an 18.9% improvement in safety performance while simultaneously increasing utility by 11.1%, demonstrating that targeted SAE steering can overcome traditional safety-utility tradeoffs when optimal features are identified through principled selection methods.