MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a novel stealthy backdoor attack against large language models that circumvents existing process-monitoring defenses by operating in the model’s output posterior space. Unlike conventional chain-of-thought (CoT)-based attacks, which leave detectable traces during reasoning, the proposed method achieves a “thinking correctly but answering incorrectly” paradigm through a tailored training strategy and minimal data poisoning—requiring only a 5% poisoning rate. Evaluated across four datasets and five mainstream models, the attack attains over 90% success rates while maintaining high stealth and robustness under stringent detection scenarios, including trigger perturbation tests and CoT-based anomaly detection. This approach significantly enhances adversarial resilience against state-of-the-art defensive mechanisms.
📝 Abstract
While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model's post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.
Problem

Research questions and friction points this paper is trying to address.

backdoor attack
Chain-of-Thought
stealthiness
reasoning
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

MirageBackdoor
Chain-of-Thought
backdoor attack
stealthiness
post-output manipulation
🔎 Similar Papers
No similar papers found.
Y
Yizhe Zeng
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
W
Wei Zhang
Beijing University of Posts and Telecommunications, Beijing, China
Yunpeng Li
Yunpeng Li
Institute of Information Engineering,Chinese Academy of Sciences
Large Language ModelsCyber Security
J
Juxin Xiao
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
X
Xiao Wang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Y
Yuling Liu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China