🤖 AI Summary
This work addresses the challenge of detecting and localizing fine-grained post-generation edits—such as human revisions or adversarial tampering—in watermarked large language model (LLM) outputs. To this end, we propose a novel combinatorial watermarking framework that partitions the vocabulary into mutually exclusive subsets and embeds deterministic combinatorial patterns during token generation. We further design lightweight local statistical metrics, enabling, for the first time, fine-grained edit detection and precise localization. By jointly leveraging global and local statistical hypothesis tests, our method achieves high detection accuracy and low Type-I error rates across diverse editing scenarios. Extensive evaluation on open-source LLMs demonstrates superior robustness and traceability compared to state-of-the-art watermarking approaches.
📝 Abstract
Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.