🤖 AI Summary
Human editing significantly degrades watermark signals embedded in LLM-generated text, causing sharp performance declines in existing detection methods. To address this, we propose Tr-GoF (Truncated Goodness-of-fit), the first watermark robustness analysis framework leveraging a mixture-model detection paradigm. Tr-GoF requires no prior knowledge of editing intensity and achieves high detection efficacy under strong editing and weak watermarking conditions by combining Gumbel-max watermark modeling with an asymptotically optimal adaptive truncation strategy—overcoming the poor noise resilience inherent in conventional sum-based detectors. Theoretical analysis establishes its asymptotic optimality. Extensive experiments on mainstream open-source models (e.g., OPT, LLaMA) and synthetic data demonstrate that, compared to state-of-the-art methods, Tr-GoF substantially improves detection power while reducing false positive rates under high human-editing ratios.
📝 Abstract
Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality extit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families.