Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

📅 2026-05-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
Existing jailbreak detection methods based on terminal representations are vulnerable to adversarial attacks. This work proposes SALO, a runtime detector that uncovers dynamic, sparse refusal trajectories within large language models by leveraging causal tracing across the generation process. Moving beyond conventional static refusal vector paradigms, SALO exploits upstream hidden states to enable robust detection even when terminal signals are suppressed. Under strong adversarial conditions where prior approaches fail—achieving near 0% detection—SALO significantly elevates jailbreak detection rates to over 90%, marking the first effective defense against forced-decoding–style jailbreak attacks.
📝 Abstract
Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal Tracing, we uncover the Refusal Trajectory-a persistent upstream signature that remains intact even when adversarial attacks (e.g., GCG) suppress terminal signals. Leveraging this, we propose SALO (Sparse Activation Localization Operator), an inference-time detector designed to capture these latent patterns. SALO effectively recovers defense capabilities against forced-decoding attacks, improving detection rates from ~0% to >90% where methods relying on terminal states perform poorly.
Problem

Research questions and friction points this paper is trying to address.

refusal trajectory
jailbreak detection
adversarial attacks
representation engineering
forced-decoding attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Refusal Trajectory
Causal Tracing
SALO
Jailbreak Detection
Dynamic Refusal