🤖 AI Summary
This work identifies a novel security vulnerability in Large Reasoning Models (LRMs): adversaries can hijack their Chain-of-Thought (CoT) inference control flow via malicious embeddings, inducing infinite reasoning loops and consequent resource exhaustion. To exploit this, we propose the “Deadlock Attack”—the first adversarial optimization method that implants a backdoor using a single-token trigger, bridging the projection gap between continuous embedding space and discrete token sequences. By precisely steering transitional tokens (e.g., “Wait”, “But”), the attack persistently blocks conclusion generation. Evaluated across four state-of-the-art LRMs and three mathematical reasoning benchmarks, the attack achieves 100% trigger rate, consistently forcing models to hit their maximum token limit, while maintaining stealth and robustness on benign inputs. This is the first systematic demonstration of controllability risks inherent in CoT mechanisms.
📝 Abstract
Modern large reasoning models (LRMs) exhibit impressive multi-step problem-solving via chain-of-thought (CoT) reasoning. However, this iterative thinking mechanism introduces a new vulnerability surface. We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow by training a malicious adversarial embedding to induce perpetual reasoning loops. Specifically, the optimized embedding encourages transitional tokens (e.g., "Wait", "But") after reasoning steps, preventing the model from concluding its answer. A key challenge we identify is the continuous-to-discrete projection gap: naïve projections of adversarial embeddings to token sequences nullify the attack. To overcome this, we introduce a backdoor implantation strategy, enabling reliable activation through specific trigger tokens. Our method achieves a 100% attack success rate across four advanced LRMs (Phi-RM, Nemotron-Nano, R1-Qwen, R1-Llama) and three math reasoning benchmarks, forcing models to generate up to their maximum token limits. The attack is also stealthy (in terms of causing negligible utility loss on benign user inputs) and remains robust against existing strategies trying to mitigate the overthinking issue. Our findings expose a critical and underexplored security vulnerability in LRMs from the perspective of reasoning (in)efficiency.