Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of detecting and localizing fine-grained post-generation edits—such as human revisions or adversarial tampering—in watermarked large language model (LLM) outputs. To this end, we propose a novel combinatorial watermarking framework that partitions the vocabulary into mutually exclusive subsets and embeds deterministic combinatorial patterns during token generation. We further design lightweight local statistical metrics, enabling, for the first time, fine-grained edit detection and precise localization. By jointly leveraging global and local statistical hypothesis tests, our method achieves high detection accuracy and low Type-I error rates across diverse editing scenarios. Extensive evaluation on open-source LLMs demonstrates superior robustness and traceability compared to state-of-the-art watermarking approaches.

Technology Category

Application Category

📝 Abstract
Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.
Problem

Research questions and friction points this paper is trying to address.

Detecting post-generation edits in watermarked LLM outputs
Localizing modifications after human revisions or spoofing attacks
Developing combinatorial watermarking to identify edited AI-generated text
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combinatorial watermarking with vocabulary subset patterns
Global statistic for watermark detection
Lightweight local statistics for edit localization
🔎 Similar Papers
No similar papers found.
Liyan Xie
Liyan Xie
Assistant Professor, University of Minnesota
Statistical machine learningonline change detectiondiffusion models
M
Muhammad Siddeek
Google
Mohamed Seif
Mohamed Seif
Princeton University
Distributed ComputingCommunication SystemsInformation TheoryTrustworthy AIPrivacy
A
Andrea J. Goldsmith
Department of Electrical and Computer Engineering, Princeton University
M
Mengdi Wang
Department of Electrical and Computer Engineering, Princeton University