TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of small language models to reasoning-based backdoor attacks, which exploit seemingly plausible but logically flawed reasoning chains that are difficult for such models to detect. To mitigate this, the authors propose a process-guided security framework that treats reasoning trajectories as untrusted payloads and employs a three-stage defense mechanism. First, contrastive reasoning synthesis automatically generates adversarial training data. Second, step-aware supervised fine-tuning (SSFT) enhances the model’s sensitivity to individual reasoning steps. Finally, verifier-guided reinforcement learning (VGRL) with group-relative policy optimization mitigates lexical overfitting to trigger tokens by focusing on logical integrity auditing. Experiments demonstrate that a compact 4B-parameter verifier achieves forensic accuracy on par with models over 100 times larger against unseen attacks—including latent backdoors and post-hoc rationalization—and maintains robustness under gray-box adaptive attacks.

Technology Category

Application Category

📝 Abstract
The deployment of Large Reasoning Models (LRMs) in high-stakes decision-making pipelines has introduced a novel and opaque attack surface: reasoning backdoors. In these attacks, the model's intermediate Chain-of-Thought (CoT) is manipulated to provide a linguistically plausible but logically fallacious justification for a malicious conclusion. While frontier models exhibit an intrinsic capacity to detect these fractures, compact, deployable models suffer from a fundamental verification gap, relying on fragile lexical heuristics that are easily bypassed by motivated adversaries. To bridge this gap, we propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls. Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy through three synergistic phases: (1) Automated Forensic Synthesis, which generates contrastive reasoning pairs to isolate the specific logical point of fracture; (2) Step-Aware Supervised Fine-Tuning (SSFT), to instill a structural verification grammar; and (3) Verifier-Guided Reinforcement Learning (VGRL), utilizing Group Relative Policy Optimization. We identify and mitigate a critical failure mode of baseline alignment - lexical overfitting - whereby verifiers memorize adversarial triggers rather than auditing logical integrity. Our empirical evaluation demonstrates that TraceGuard acts as a security force multiplier: a 4B-parameter verifier achieves forensic precision on unseen attacks - including latent backdoors and post-hoc rationalizations - that rivals architectures two orders of magnitude larger. We further demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive for the Trusted Computing Base.
Problem

Research questions and friction points this paper is trying to address.

reasoning backdoors
Chain-of-Thought
Large Language Models
logical integrity
adversarial attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning backdoors
Chain-of-Thought verification
step-aware fine-tuning
verifier-guided reinforcement learning
lexical overfitting mitigation
🔎 Similar Papers
2024-08-01arXiv.orgCitations: 20
2024-07-01Conference on Empirical Methods in Natural Language ProcessingCitations: 2