BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work identifies and formalizes a novel backdoor attack—“overthinking backdoor”—targeting the chain-of-thought (CoT) mechanism in large reasoning models (LRMs), enabling fine-grained control over inference step redundancy rather than binary trigger activation. Methodologically, we propose a programmable data poisoning framework: a teacher model generates CoT samples with deliberate step redundancy, and repeated trigger tokens encode signal strength to controllably amplify the target model’s inference length. Our key contributions include the first formalization of tunable over-reasoning backdoors, which preserve output accuracy (stealth), scale across architectures, and induce targeted computational overhead. Experiments across diverse LRMs demonstrate stable 2–5× increases in inference steps while maintaining answer accuracy, establishing a new paradigm for security evaluation of reasoning models.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term "overthinking backdoors". We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model's reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer's correctness. Our source code is available at https://github.com/FZaKK/BadReasoner.

Problem

Research questions and friction points this paper is trying to address.

Identifying overthinking backdoors in large reasoning models

Proposing tunable backdoors to control reasoning verbosity

Implementing stealthy data poisoning to increase resource consumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tunable backdoor controls reasoning verbosity precisely

Data poisoning pairs triggers with verbose responses

Programmatic redundant refinement steps ensure stealth

🔎 Similar Papers

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks