Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the security risks posed by lightweight adapter-based hijacking of Chain-of-Thought (CoT) reasoning in open-weight large language models. To tackle the challenges of CoT hijacking difficulty, scarcity of malicious training data, and backdoor instability, the authors propose a Two-Stage Backdoor Hijacking (TSBH) method. TSBH first induces inconsistency between intermediate CoT steps and malicious outputs under trigger conditions, then employs Multi-Reverse Tree Search (MRTS) to synthesize semantically similar malicious CoT sequences for fine-tuning. By integrating embedding distance constraints with precise trigger control, the approach achieves high-fidelity, stable, and controllable CoT hijacking. Experiments across multiple open-source large language models demonstrate quantifiable and distinguishable hijacking effects, and the study introduces the first safety-oriented reasoning dataset to support future research in this domain.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly deployed in settings where Chain-of-Thought (CoT) is interpreted by users. This creates a new safety risk: attackers may manipulate the model's observable CoT to make malicious behaviors. In open-weight ecosystems, such manipulation can be embedded in lightweight adapters that are easy to distribute and attach to base models. In practice, persistent CoT hijacking faces three main challenges: the difficulty of directly hijacking CoT tokens within one continuous long CoT-output sequence while maintaining stable downstream outputs, the scarcity of malicious CoT data, and the instability of naive backdoor injection methods. To address the data scarcity issue, we propose Multiple Reverse Tree Search (MRTS), a reverse synthesis procedure that constructs output-aligned CoTs from prompt-output pairs without directly eliciting malicious CoTs from aligned models. Building on MRTS, we introduce Two-stage Backdoor Hijacking (TSBH), which first induces a trigger-conditioned mismatch between intermediate CoT and malicious outputs, and then fine-tunes the model on MRTS-generated CoTs that have lower embedding distance to the malicious outputs, thereby ensuring stronger semantic similarity. Experiments across multiple open-weight models demonstrate that our method successfully induces trigger-activated CoT hijacking while maintaining a quantifiable distinction between hijacked and baseline states under our evaluation framework. We further explore a reasoning-based mitigation approach and release a safety-reasoning dataset to support future research on safety-aware and reliable reasoning. Our code is available at https://github.com/ChangWenhan/TSBH_official.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought Hijacking
Backdoor Attack
Large Language Models
Safety Risk
Adversarial Manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Hijacking
Backdoor Attack
Multiple Reverse Tree Search
Two-stage Backdoor Hijacking
Safety-aware Reasoning
🔎 Similar Papers
2024-07-01Conference on Empirical Methods in Natural Language ProcessingCitations: 2