MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work proposes a novel stealthy backdoor attack against large language models that circumvents existing process-monitoring defenses by operating in the model’s output posterior space. Unlike conventional chain-of-thought (CoT)-based attacks, which leave detectable traces during reasoning, the proposed method achieves a “thinking correctly but answering incorrectly” paradigm through a tailored training strategy and minimal data poisoning—requiring only a 5% poisoning rate. Evaluated across four datasets and five mainstream models, the attack attains over 90% success rates while maintaining high stealth and robustness under stringent detection scenarios, including trigger perturbation tests and CoT-based anomaly detection. This approach significantly enhances adversarial resilience against state-of-the-art defensive mechanisms.

Technology Category

Application Category

📝 Abstract

While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model's post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.

Problem

Research questions and friction points this paper is trying to address.

backdoor attack

Chain-of-Thought

stealthiness

reasoning

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

MirageBackdoor

Chain-of-Thought

backdoor attack