Can Watermarked LLMs be Identified by Users via Crafted Prompts?

📅 2024-10-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the critical challenges of high perceptibility and insufficient robustness in large language model (LLM) watermarking. To tackle these issues, we propose a systematic evaluation and enhancement framework. First, we formally define and empirically validate the “imperceptibility” dimension of LLM watermarks—the first such rigorous treatment in the literature. Second, we introduce Water-Probe, a black-box watermark detection method leveraging prompt engineering and statistical bias analysis, achieving high detection accuracy across mainstream watermarking schemes with false positive rates below 1%. Third, we propose Water-Bag, a multi-key dynamic fusion strategy that significantly improves watermark stealth: user-side indistinguishability increases by over 3×, without degrading text quality or inference latency. Our framework establishes a novel, evaluable, and augmentable paradigm for practical LLM watermark deployment.

Technology Category

Application Category

📝 Abstract

Text watermarking for Large Language Models (LLMs) has made significant progress in detecting LLM outputs and preventing misuse. Current watermarking techniques offer high detectability, minimal impact on text quality, and robustness to text editing. However, current researches lack investigation into the imperceptibility of watermarking techniques in LLM services. This is crucial as LLM providers may not want to disclose the presence of watermarks in real-world scenarios, as it could reduce user willingness to use the service and make watermarks more vulnerable to attacks. This work is the first to investigate the imperceptibility of watermarked LLMs. We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM. Our key motivation is that current watermarked LLMs expose consistent biases under the same watermark key, resulting in similar differences across prompts under different watermark keys. Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts, while Water-Probe demonstrates a minimal false positive rate for non-watermarked LLMs. Finally, we propose that the key to enhancing the imperceptibility of watermarked LLMs is to increase the randomness of watermark key selection. Based on this, we introduce the Water-Bag strategy, which significantly improves watermark imperceptibility by merging multiple watermark keys.

Problem

Research questions and friction points this paper is trying to address.

Watermark Detection

Watermark Hiding

Robustness in Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Watermark Concealment

Water-Probe Detection

Water-Bag Method

🔎 Similar Papers

Discovering Spoofing Attempts on Language Model Watermarks