🤖 AI Summary
Existing jailbreak detection methods based on terminal representations are vulnerable to adversarial attacks. This work proposes SALO, a runtime detector that uncovers dynamic, sparse refusal trajectories within large language models by leveraging causal tracing across the generation process. Moving beyond conventional static refusal vector paradigms, SALO exploits upstream hidden states to enable robust detection even when terminal signals are suppressed. Under strong adversarial conditions where prior approaches fail—achieving near 0% detection—SALO significantly elevates jailbreak detection rates to over 90%, marking the first effective defense against forced-decoding–style jailbreak attacks.
📝 Abstract
Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inference-time detector designed to capture these latent patterns. SALO effectively recovers defense capabilities against forced-decoding attacks, improving detection rates from ~0% to >90% where methods relying on terminal states perform poorly.