🤖 AI Summary
This work identifies and formalizes a novel backdoor attack—“overthinking backdoor”—targeting the chain-of-thought (CoT) mechanism in large reasoning models (LRMs), enabling fine-grained control over inference step redundancy rather than binary trigger activation. Methodologically, we propose a programmable data poisoning framework: a teacher model generates CoT samples with deliberate step redundancy, and repeated trigger tokens encode signal strength to controllably amplify the target model’s inference length. Our key contributions include the first formalization of tunable over-reasoning backdoors, which preserve output accuracy (stealth), scale across architectures, and induce targeted computational overhead. Experiments across diverse LRMs demonstrate stable 2–5× increases in inference steps while maintaining answer accuracy, establishing a new paradigm for security evaluation of reasoning models.
📝 Abstract
Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term "overthinking backdoors". We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model's reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer's correctness. Our source code is available at https://github.com/FZaKK/BadReasoner.