Understanding Refusal in Language Models with Sparse Autoencoders

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study investigates the internal mechanisms by which large language models (LLMs) refuse harmful instructions. We propose a sparse autoencoder (SAE)-based latent feature localization method that, for the first time, identifies causal neural features driving refusal behavior within the activation space. Through causal intervention, we validate the interpretability of these features—demonstrating their direct influence on output generation—and uncover their mechanistic links to upstream safety alignment signals and downstream jailbreak failure. Evaluated on two open-source dialogue models across multiple harmful instruction datasets, our method successfully isolates intervenable refusal features. These features significantly enhance the out-of-distribution robustness of linear probes on jailbroken inputs, improving average detection accuracy by +12.7%. Our work establishes an interpretable, intervention-aware, mechanism-level analytical framework for LLM safety alignment.

Technology Category

Application Category

📝 Abstract

Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.

Problem

Research questions and friction points this paper is trying to address.

Understanding internal mechanisms of refusal in language models

Identifying latent features causing refusal behaviors using autoencoders

Enhancing generalization for adversarial samples in classification tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using sparse autoencoders to study refusal mechanisms

Intervening on refusal-related features for validation

Enhancing generalization with refusal features in classification

🔎 Similar Papers

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation