🤖 AI Summary
This study investigates the internal mechanisms by which large language models (LLMs) refuse harmful instructions. We propose a sparse autoencoder (SAE)-based latent feature localization method that, for the first time, identifies causal neural features driving refusal behavior within the activation space. Through causal intervention, we validate the interpretability of these features—demonstrating their direct influence on output generation—and uncover their mechanistic links to upstream safety alignment signals and downstream jailbreak failure. Evaluated on two open-source dialogue models across multiple harmful instruction datasets, our method successfully isolates intervenable refusal features. These features significantly enhance the out-of-distribution robustness of linear probes on jailbroken inputs, improving average detection accuracy by +12.7%. Our work establishes an interpretable, intervention-aware, mechanism-level analytical framework for LLM safety alignment.
📝 Abstract
Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.