Improving Detection of Watermarked Language Models

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Post-training techniques (e.g., instruction tuning, RLHF) reduce language model output entropy, severely degrading conventional watermark detection performance. To address this, we propose a hybrid detection framework integrating watermark and non-watermark detectors. Our method introduces multiple fusion mechanisms to jointly model watermark-specific statistical features alongside linguistic and distributional characteristics of generated text, thereby overcoming the discriminative limitations of single-modality watermarking methods in low-entropy regimes. We conduct systematic evaluations across diverse models (Llama, Qwen, Phi), prompts, and post-training intensities. Results show that our approach achieves an average 12.7% F1-score improvement over the best single-detector baseline, significantly enhancing robust identification of outputs from alignment-optimized large language models. This work provides a scalable, practical solution for AIGC provenance tracking.

Technology Category

Application Category

📝 Abstract

Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.

Problem

Research questions and friction points this paper is trying to address.

Enhancing detection of watermarked LLM outputs

Addressing low entropy in post-trained models

Combining watermark and non-watermark detectors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines watermark and non-watermark detectors

Hybrid schemes improve detection performance

Effective under various experimental conditions

🔎 Similar Papers

Discovering Spoofing Attempts on Language Model Watermarks