Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

๐Ÿ“… 2026-05-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates whether predictable safety signals exist in the internal activations of large language models prior to generating refusal responses. The authors train linear probes on residual streams and demonstrate for the first time that refusal behavior becomes linearly decodable many layers before output generation. Building on this finding, they propose Mechanistic AutoDAN, a genetic prompt search framework guided by intermediate activations. This approach achieves attack success rates comparable to the original AutoDAN while reducing per-iteration search time by up to 72%. Moreover, it exhibits stronger probe-guided efficacy and improved cross-model transferability, particularly on larger-scale models.
๐Ÿ“ Abstract
In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with partial forward passes and probe-based scoring inside a genetic prompt search loop. Across the evaluated models, our method achieves attack success rates competitive with vanilla AutoDAN while reducing per-iteration search time by up to 72%, and probe-guided prompts match or exceed AutoDAN's cross-model transfer in several configurations. We further find that the usefulness of probe guidance increases with model scale. Our results show that refusal is not only observable at the output level, but is encoded as a structured and actionable signal in intermediate LLM activations.
Problem

Research questions and friction points this paper is trying to address.

refusal detection
intermediate activations
large language models
safety behavior
linear probes
Innovation

Methods, ideas, or system contributions that make the work stand out.

refusal detection
intermediate activations
linear probes
Mechanistic AutoDAN
prompt optimization