Thought Purity: Defense Paradigm For Chain-of-Thought Attack

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Large reasoning models (LRMs) trained via reinforcement learning (RL) exhibit vulnerability to low-cost backdoor attacks targeting chain-of-thought (CoT) generation—termed CoT attacks (CoTA)—leading to simultaneous degradation in both safety and task performance. Method: This paper proposes “Thought Purity,” a novel defense paradigm featuring the first end-to-end CoT defense framework tailored for RL-aligned reasoning systems. It integrates three synergistic mechanisms: (i) secure data-flow processing, (ii) rule-augmented dynamic constraints driven by RL-based optimization, and (iii) behavior-modeling–enabled real-time monitoring. Contribution/Results: The framework achieves a dynamic trade-off between attack robustness and reasoning fidelity. Experiments demonstrate significant improvements in LRM security and task stability across diverse CoTA scenarios, with strong generalization across attack variants and practical deployability.

Technology Category

Application Category

📝 Abstract

While reinforcement learning-trained Large Reasoning Models (LRMs, e.g., Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large Language Models (LLMs) domain, their susceptibility to security threats remains a critical vulnerability. This weakness is particularly evident in Chain-of-Thought (CoT) generation processes, where adversarial methods like backdoor prompt attacks can systematically subvert the model's core reasoning mechanisms. The emerging Chain-of-Thought Attack (CoTA) reveals this vulnerability through exploiting prompt controllability, simultaneously degrading both CoT safety and task performance with low-cost interventions. To address this compounded security-performance vulnerability, we propose Thought Purity (TP): a defense paradigm that systematically strengthens resistance to malicious content while preserving operational efficacy. Our solution achieves this through three synergistic components: (1) a safety-optimized data processing pipeline (2) reinforcement learning-enhanced rule constraints (3) adaptive monitoring metrics. Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems, significantly advancing the security-functionality equilibrium for next-generation AI architectures.

Problem

Research questions and friction points this paper is trying to address.

Defend against Chain-of-Thought attacks in reasoning models

Prevent adversarial subversion of core reasoning mechanisms

Balance security and performance in AI architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Safety-optimized data processing pipeline

Reinforcement learning-enhanced rule constraints

Adaptive monitoring metrics

🔎 Similar Papers

A Survey of Defenses against AI-generated Visual Media: Detection, Disruption, and Authentication