Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing jailbreak detection methods based on terminal representations are vulnerable to adversarial attacks. This work proposes SALO, a runtime detector that uncovers dynamic, sparse refusal trajectories within large language models by leveraging causal tracing across the generation process. Moving beyond conventional static refusal vector paradigms, SALO exploits upstream hidden states to enable robust detection even when terminal signals are suppressed. Under strong adversarial conditions where prior approaches fail—achieving near 0% detection—SALO significantly elevates jailbreak detection rates to over 90%, marking the first effective defense against forced-decoding–style jailbreak attacks.

📝 Abstract

Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inference-time detector designed to capture these latent patterns. SALO effectively recovers defense capabilities against forced-decoding attacks, improving detection rates from ~0% to >90% where methods relying on terminal states perform poorly.

Problem

Research questions and friction points this paper is trying to address.

refusal trajectory

jailbreak detection

adversarial attacks

representation engineering

forced-decoding attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Refusal Trajectory

Causal Tracing

SALO