Robust Detection of Watermarks for Large Language Models Under Human Edits

📅 2024-11-21

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Human editing significantly degrades watermark signals embedded in LLM-generated text, causing sharp performance declines in existing detection methods. To address this, we propose Tr-GoF (Truncated Goodness-of-fit), the first watermark robustness analysis framework leveraging a mixture-model detection paradigm. Tr-GoF requires no prior knowledge of editing intensity and achieves high detection efficacy under strong editing and weak watermarking conditions by combining Gumbel-max watermark modeling with an asymptotically optimal adaptive truncation strategy—overcoming the poor noise resilience inherent in conventional sum-based detectors. Theoretical analysis establishes its asymptotic optimality. Extensive experiments on mainstream open-source models (e.g., OPT, LLaMA) and synthetic data demonstrate that, compared to state-of-the-art methods, Tr-GoF substantially improves detection power while reducing false positive rates under high human-editing ratios.

Technology Category

Application Category

📝 Abstract

Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality extit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families.

Problem

Research questions and friction points this paper is trying to address.

Detect watermarks in LLM text under human edits

Improve robustness against edit-induced signal dilution

Achieve optimal detection without prior edit knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Truncated goodness-of-fit test for watermark detection

Adaptive optimality without edit level knowledge

Superior robustness against human edits

🔎 Similar Papers

Is The Watermarking Of LLM-Generated Code Robust?