One Token Embedding Is Enough to Deadlock Your Large Reasoning Model

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work identifies a novel security vulnerability in Large Reasoning Models (LRMs): adversaries can hijack their Chain-of-Thought (CoT) inference control flow via malicious embeddings, inducing infinite reasoning loops and consequent resource exhaustion. To exploit this, we propose the “Deadlock Attack”—the first adversarial optimization method that implants a backdoor using a single-token trigger, bridging the projection gap between continuous embedding space and discrete token sequences. By precisely steering transitional tokens (e.g., “Wait”, “But”), the attack persistently blocks conclusion generation. Evaluated across four state-of-the-art LRMs and three mathematical reasoning benchmarks, the attack achieves 100% trigger rate, consistently forcing models to hit their maximum token limit, while maintaining stealth and robustness on benign inputs. This is the first systematic demonstration of controllability risks inherent in CoT mechanisms.

Technology Category

Application Category

📝 Abstract

Modern large reasoning models (LRMs) exhibit impressive multi-step problem-solving via chain-of-thought (CoT) reasoning. However, this iterative thinking mechanism introduces a new vulnerability surface. We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow by training a malicious adversarial embedding to induce perpetual reasoning loops. Specifically, the optimized embedding encourages transitional tokens (e.g., "Wait", "But") after reasoning steps, preventing the model from concluding its answer. A key challenge we identify is the continuous-to-discrete projection gap: naïve projections of adversarial embeddings to token sequences nullify the attack. To overcome this, we introduce a backdoor implantation strategy, enabling reliable activation through specific trigger tokens. Our method achieves a 100% attack success rate across four advanced LRMs (Phi-RM, Nemotron-Nano, R1-Qwen, R1-Llama) and three math reasoning benchmarks, forcing models to generate up to their maximum token limits. The attack is also stealthy (in terms of causing negligible utility loss on benign user inputs) and remains robust against existing strategies trying to mitigate the overthinking issue. Our findings expose a critical and underexplored security vulnerability in LRMs from the perspective of reasoning (in)efficiency.

Problem

Research questions and friction points this paper is trying to address.

Inducing perpetual reasoning loops in large reasoning models

Overcoming continuous-to-discrete projection gap for attack reliability

Exploiting transitional tokens to prevent model answer conclusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial embedding induces perpetual reasoning loops

Backdoor implantation strategy enables reliable attack activation

Method achieves 100% success rate across multiple LRMs

🔎 Similar Papers

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models