Thought Purity: Defense Paradigm For Chain-of-Thought Attack

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) trained via reinforcement learning (RL) exhibit vulnerability to low-cost backdoor attacks targeting chain-of-thought (CoT) generation—termed CoT attacks (CoTA)—leading to simultaneous degradation in both safety and task performance. Method: This paper proposes “Thought Purity,” a novel defense paradigm featuring the first end-to-end CoT defense framework tailored for RL-aligned reasoning systems. It integrates three synergistic mechanisms: (i) secure data-flow processing, (ii) rule-augmented dynamic constraints driven by RL-based optimization, and (iii) behavior-modeling–enabled real-time monitoring. Contribution/Results: The framework achieves a dynamic trade-off between attack robustness and reasoning fidelity. Experiments demonstrate significant improvements in LRM security and task stability across diverse CoTA scenarios, with strong generalization across attack variants and practical deployability.

Technology Category

Application Category

📝 Abstract
While reinforcement learning-trained Large Reasoning Models (LRMs, e.g., Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large Language Models (LLMs) domain, their susceptibility to security threats remains a critical vulnerability. This weakness is particularly evident in Chain-of-Thought (CoT) generation processes, where adversarial methods like backdoor prompt attacks can systematically subvert the model's core reasoning mechanisms. The emerging Chain-of-Thought Attack (CoTA) reveals this vulnerability through exploiting prompt controllability, simultaneously degrading both CoT safety and task performance with low-cost interventions. To address this compounded security-performance vulnerability, we propose Thought Purity (TP): a defense paradigm that systematically strengthens resistance to malicious content while preserving operational efficacy. Our solution achieves this through three synergistic components: (1) a safety-optimized data processing pipeline (2) reinforcement learning-enhanced rule constraints (3) adaptive monitoring metrics. Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems, significantly advancing the security-functionality equilibrium for next-generation AI architectures.
Problem

Research questions and friction points this paper is trying to address.

Defend against Chain-of-Thought attacks in reasoning models
Prevent adversarial subversion of core reasoning mechanisms
Balance security and performance in AI architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Safety-optimized data processing pipeline
Reinforcement learning-enhanced rule constraints
Adaptive monitoring metrics
🔎 Similar Papers
No similar papers found.
Z
Zihao Xue
Huzhou University
Zhen Bi
Zhen Bi
Zhejiang University, Huzhou University
Knowledge GraphLanguage ModelOn-device LLM
Long Ma
Long Ma
Dalian University of Technology
Computer VisionImage Processing
Z
Zhenlin Hu
Huzhou University
Y
Yan Wang
Alibaba Group
Z
Zhenfang Liu
Huzhou University, Zhejiang Key Laboratory of Intelligent Education Technology and Application
Q
Qing Sheng
Huzhou University, Zhejiang Key Laboratory of Intelligent Education Technology and Application
Jie Xiao
Jie Xiao
University of Science and Technology of China
low level visiongenerative modelmachine learning
J
Jungang Lou
Huzhou University, Zhejiang Key Laboratory of Intelligent Education Technology and Application