Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work proposes an indirect, targeted poisoning method against chain-of-thought (CoT) large language models that circumvents the limitations of existing backdoor attacks, which rely on explicitly injecting trigger-laden, mislabeled examples and require access to target-domain data. Instead, the approach embeds malicious reasoning logic into the CoT trajectories of a source task by subtly altering only the intermediate reasoning steps—leaving inputs and final answers unchanged—thereby achieving clean-label poisoning. This study demonstrates for the first time the feasibility of cross-task CoT poisoning, enabling adversarial control over model behavior on unseen target tasks without any target-domain samples, leveraging transfer learning to propagate corrupted reasoning. Experiments show the attack achieves up to 70% success on completely held-out tasks while simultaneously improving model performance by 10–15% on multiple reasoning benchmarks, enhancing stealth and evading current defenses.

Technology Category

Application Category

📝 Abstract

Chain-of-Thought (CoT) reasoning has emerged as a powerful technique for enhancing large language models'capabilities by generating intermediate reasoning steps for complex tasks. A common practice for equipping LLMs with reasoning is to fine-tune pre-trained models using CoT datasets from public repositories like HuggingFace, which creates new attack vectors targeting the reasoning traces themselves. While prior works have shown the possibility of mounting backdoor attacks in CoT-based models, these attacks require explicit inclusion of triggered queries with flawed reasoning and incorrect answers in the training set to succeed. Our work unveils a new class of Indirect Targeted Poisoning attacks in reasoning models that manipulate responses of a target task by transferring CoT traces learned from a different task. Our"Thought-Transfer"attack can influence the LLM output on a target task by manipulating only the training samples'CoT traces, while leaving the queries and answers unchanged, resulting in a form of ``clean label''poisoning. Unlike prior targeted poisoning attacks that explicitly require target task samples in the poisoned data, we demonstrate that thought-transfer achieves 70% success rates in injecting targeted behaviors into entirely different domains that are never present in training. Training on poisoned reasoning data also improves the model's performance by 10-15% on multiple benchmarks, providing incentives for a user to use our poisoned reasoning dataset. Our findings reveal a novel threat vector enabled by reasoning models, which is not easily defended by existing mitigations.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

poisoning attacks

reasoning models

indirect targeted attack

clean label

Innovation

Methods, ideas, or system contributions that make the work stand out.

Thought-Transfer

Chain-of-Thought reasoning

Indirect Targeted Poisoning