π€ AI Summary
This work addresses the limited robustness of existing large language model watermarking methods under output perturbations or post-training modifications such as fine-tuning. To overcome this, the authors propose a reasoning-layer watermarking framework based on Redundant Chain-of-Thought (R-CoT), which embeds watermarks into the modelβs internal reasoning pathways rather than its output distribution, thereby internalizing them as distinctive reasoning strategies. Leveraging a GRPO-based dual-trajectory optimization mechanism, the method concurrently constructs native and watermarked reasoning paths within a shared parameter space, enabling their synergistic coexistence. Experimental results demonstrate that the approach maintains a true positive watermark detection rate above 95% across diverse post-training scenarios, significantly outperforming current techniques in both effectiveness and robustness.
π Abstract
Large language models (LLMs) are widely deployed in multiple scenarios due to reasoning capabilities. In order to prevent the models from being misused, watermarking is generally employed to ensure ownership. However, most existing watermarking methods rely on superficial modifications to the model's output distribution, rendering the watermark vulnerable to perturbation and removal. To overcome this challenge, this paper introduces a reasoning-layer framework termed Redundant Chain-of-Thought (R-CoT), which embeds watermarks into the reasoning path. A dual-trajectory optimization mechanism based on GRPO enables the native and the watermark reasoning path to coexist within a shared parameter space, internalizing the watermark as a distinct reasoning policy. Therefore, the watermark is embedded into the model's stable reasoning path, avoiding the watermark failure caused by output-level perturbations. Experimental results show that, compared with existing methods, R-CoT achieves high watermark effectiveness and strong robustness. Under fine-tuning and other post-training operations, the true positive rate (TPR) consistently remains above 95%, exhibiting only marginal degradation.