🤖 AI Summary
This work addresses the degradation of refusal capability and reduced safety in compressed large language models (LLMs). We introduce mechanistic interpretability—specifically residual stream activation tracing and directional projection analysis—to systematically assess safety in compressed models, identifying architecture-agnostic refusal directions in the representation space. Based on this insight, we propose a lightweight, training-free directional intervention method that directly modulates activations along these refusal-aligned directions. Our approach restores refusal capability without any accuracy loss on standard benchmarks, increasing safe response rates by 32.7%, while incurring less than 0.5% additional inference overhead. Key contributions include: (1) uncovering an interpretable mechanistic explanation for refusal degradation under compression; (2) discovering a universal refusal direction across model architectures; and (3) establishing a novel, efficient, zero-shot, performance-preserving safety enhancement paradigm for compressed LLMs.
📝 Abstract
The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability has gained traction, with notable discoveries, such as the identification of a single direction in the residual stream mediating refusal behaviors across diverse model architectures. In this work, we investigate the safety of compressed models by examining the mechanisms of refusal, adopting a novel interpretability-driven perspective to evaluate model safety. Furthermore, leveraging insights from our interpretability analysis, we propose a lightweight, computationally efficient method to enhance the safety of compressed models without compromising their performance or utility.