🤖 AI Summary
Post-training techniques (e.g., instruction tuning, RLHF) reduce language model output entropy, severely degrading conventional watermark detection performance. To address this, we propose a hybrid detection framework integrating watermark and non-watermark detectors. Our method introduces multiple fusion mechanisms to jointly model watermark-specific statistical features alongside linguistic and distributional characteristics of generated text, thereby overcoming the discriminative limitations of single-modality watermarking methods in low-entropy regimes. We conduct systematic evaluations across diverse models (Llama, Qwen, Phi), prompts, and post-training intensities. Results show that our approach achieves an average 12.7% F1-score improvement over the best single-detector baseline, significantly enhancing robust identification of outputs from alignment-optimized large language models. This work provides a scalable, practical solution for AIGC provenance tracking.
📝 Abstract
Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.