Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
Large language models are vulnerable to reasoning-level backdoor attacks, wherein adversaries embed malicious reasoning steps into the chain-of-thought via trigger mechanisms, leading models to produce seemingly plausible yet harmful outputs. This work proposes Critical-CoT, a novel defense framework specifically designed to counter such attacks. Critical-CoT employs a two-stage fine-tuning strategy to endow models with critical thinking capabilities, enabling them to automatically detect and reject compromised reasoning steps. Experimental results demonstrate that the proposed method achieves strong robustness against both in-context learning and fine-tuning-based backdoor attacks across multiple mainstream large language models and datasets. Furthermore, it significantly enhances model safety while exhibiting excellent generalization across domains and tasks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: once triggered, the victim model inserts one or more malicious reasoning steps into its chain-of-thought (CoT). These attacks are substantially harder to detect, as the backdoored answer remains plausible and consistent with the poisoned reasoning trajectory. Yet, defenses tailored to this type of backdoor remain largely unexplored. To bridge this gap, we propose Critical-CoT, a novel defense mechanism that conducts a two-stage fine-tuning (FT) process on LLMs to develop critical thinking behaviors, enabling them to automatically identify potential backdoors and refuse to generate malicious reasoning steps. Extensive experiments across multiple LLMs and datasets demonstrate that Critical-CoT provides strong robustness against both in-context learning-based and FT-based backdoor attacks. Notably, Critical-CoT exhibits strong cross-domain and cross-task generalization. Our code is available at hthttps://github.com/tuanvu171/Critical-CoT.
Problem

Research questions and friction points this paper is trying to address.

backdoor attacks
reasoning-level
large language models
chain-of-thought
adversarial defense
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning-level backdoor
chain-of-thought
critical thinking
defense framework
large language models