🤖 AI Summary
Explicit chain-of-thought (CoT) reasoning in large reasoning models (LRMs) was intended to enhance safety guardrails, yet this work reveals it can be maliciously exploited for jailbreaking. Method: We propose CoT hijacking—a novel attack paradigm wherein long sequences of benign reasoning steps are prepended to harmful queries, diluting mid-layer safety signals and disrupting final-layer safety verification. Using mechanistic interpretability, we identify critical attention heads within the safety subnetwork and validate our hypothesis via targeted attention interference and ablation studies. Contribution/Results: We provide the first empirical evidence that CoT’s inherent interpretability paradoxically constitutes a security vulnerability. Our attack achieves up to 100% success across major LRMs—including Gemini, GPT, Grok, and Claude—substantially outperforming prior methods. All code and data are publicly released.
📝 Abstract
Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.