Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

📅 2025-04-05

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the degradation of refusal capability and reduced safety in compressed large language models (LLMs). We introduce mechanistic interpretability—specifically residual stream activation tracing and directional projection analysis—to systematically assess safety in compressed models, identifying architecture-agnostic refusal directions in the representation space. Based on this insight, we propose a lightweight, training-free directional intervention method that directly modulates activations along these refusal-aligned directions. Our approach restores refusal capability without any accuracy loss on standard benchmarks, increasing safe response rates by 32.7%, while incurring less than 0.5% additional inference overhead. Key contributions include: (1) uncovering an interpretable mechanistic explanation for refusal degradation under compression; (2) discovering a universal refusal direction across model architectures; and (3) establishing a novel, efficient, zero-shot, performance-preserving safety enhancement paradigm for compressed LLMs.

Technology Category

Application Category

📝 Abstract

The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability has gained traction, with notable discoveries, such as the identification of a single direction in the residual stream mediating refusal behaviors across diverse model architectures. In this work, we investigate the safety of compressed models by examining the mechanisms of refusal, adopting a novel interpretability-driven perspective to evaluate model safety. Furthermore, leveraging insights from our interpretability analysis, we propose a lightweight, computationally efficient method to enhance the safety of compressed models without compromising their performance or utility.

Problem

Research questions and friction points this paper is trying to address.

Understanding refusal mechanisms in compressed language models

Improving safety of compressed models via interpretability

Enhancing model safety without performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mechanistic interpretability analyzes refusal mechanisms

Lightweight method enhances compressed model safety

Computationally efficient without performance loss

🔎 Similar Papers

Efficient and Accurate Explanation Estimation with Distribution Compression