Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of detecting and localizing fine-grained post-generation edits—such as human revisions or adversarial tampering—in watermarked large language model (LLM) outputs. To this end, we propose a novel combinatorial watermarking framework that partitions the vocabulary into mutually exclusive subsets and embeds deterministic combinatorial patterns during token generation. We further design lightweight local statistical metrics, enabling, for the first time, fine-grained edit detection and precise localization. By jointly leveraging global and local statistical hypothesis tests, our method achieves high detection accuracy and low Type-I error rates across diverse editing scenarios. Extensive evaluation on open-source LLMs demonstrates superior robustness and traceability compared to state-of-the-art watermarking approaches.

Technology Category

Application Category

📝 Abstract

Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.

Problem

Research questions and friction points this paper is trying to address.

Detecting post-generation edits in watermarked LLM outputs

Localizing modifications after human revisions or spoofing attacks

Developing combinatorial watermarking to identify edited AI-generated text

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combinatorial watermarking with vocabulary subset patterns

Global statistic for watermark detection

Lightweight local statistics for edit localization

🔎 Similar Papers

Discovering Spoofing Attempts on Language Model Watermarks