Echoes within the Reasoning: Stealthy and Effective Watermarking via Chain of Thought

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing black-box watermarking methods for large language models with chain-of-thought (CoT) reasoning capabilities often compromise reasoning fidelity and robustness by perturbing outputs or relying on fragile trigger mechanisms. This work proposes BiCoT, a framework that embeds watermarks into the intrinsic geometric structure of reasoning trajectories by aligning high-salience structural anchors to a private signature subspace. To enhance resilience against model stealing and representation drift, BiCoT introduces a Robust Subspace Registration (RSR) mechanism. Combined with Top-n logprob-based black-box verification, sentinel token calibration, and control token regularization, the method preserves semantic fidelity across diverse complex reasoning tasks and enables cross-domain robust watermark detection under fine-tuning, quantization, model-level perturbations, and adaptive attacks.

📝 Abstract

Large Language Models with Chain-of-Thought reasoning capabilities represent valuable intellectual property, yet existing black-box watermarking methods often trade robustness for reasoning fidelity by perturbing final answers or relying on fragile trigger patterns. We propose BiCoT, a watermarking framework that embeds ownership signals into the internal geometry of reasoning traces by aligning high-saliency structural anchors with a private signature subspace while regularizing ordinary control tokens to preserve semantic capacity. This design couples the watermark with reasoning-relevant representations, making removal difficult without disrupting the features that support coherent reasoning. To enable verification under model theft and representation drift, we introduce Robust Subspace Registration (RSR), a Top- logprob-based black-box verifier that uses sentinel tokens to calibrate systematic shifts in the output distribution. Experiments show that BiCoT preserves reasoning fidelity across diverse complex reasoning tasks while achieving robust detection under fine-tuning, quantization, model-level perturbations, and adaptive output-level attacks across in-domain and out-of-distribution settings.

Problem

Research questions and friction points this paper is trying to address.

watermarking

Chain-of-Thought

large language models

intellectual property

reasoning fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

watermarking

Chain-of-Thought

representation geometry