Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking

📅 2025-05-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the theoretical claim that AI-generated text watermarks are inherently vulnerable to erasure via random perturbations. Method: We introduce a novel human-in-the-loop experimental paradigm integrating LLM-based perturbative editing, multi-dimensional automated evaluation (BERTScore, GPT-4 discrimination), and crowdsourced human quality assessment, augmented by rigorous statistical significance testing to evaluate the efficacy of random-walk watermark removal attacks. Contribution/Results: We find that 100% of perturbed texts retain detectable watermarks within 100 editing steps; automated attack success is only 26%, dropping sharply to 10% after human quality filtering. Our study is the first to empirically expose critical violations of two foundational theoretical assumptions—“rapid mixing” and “quality-preserving perturbations”—demonstrating that real-world watermark resilience substantially exceeds theoretical predictions and providing essential empirical grounding for practical watermark deployment.

Technology Category

Application Category

📝 Abstract
Watermarking AI-generated text is critical for combating misuse. Yet recent theoretical work argues that any watermark can be erased via random walk attacks that perturb text while preserving quality. However, such attacks rely on two key assumptions: (1) rapid mixing (watermarks dissolve quickly under perturbations) and (2) reliable quality preservation (automated quality oracles perfectly guide edits). Through large-scale experiments and human-validated assessments, we find mixing is slow: 100% of perturbed texts retain traces of their origin after hundreds of edits, defying rapid mixing. Oracles falter, as state-of-the-art quality detectors misjudge edits (77% accuracy), compounding errors during attacks. Ultimately, attacks underperform: automated walks remove watermarks just 26% of the time -- dropping to 10% under human quality review. These findings challenge the inevitability of watermark removal. Instead, practical barriers -- slow mixing and imperfect quality control -- reveal watermarking to be far more robust than theoretical models suggest. The gap between idealized attacks and real-world feasibility underscores the need for stronger watermarking methods and more realistic attack models.
Problem

Research questions and friction points this paper is trying to address.

Study challenges rapid mixing assumption in watermark removal attacks
Exposes flaws in automated quality oracles during text perturbation
Demonstrates real-world robustness gaps in theoretical watermark removal
Innovation

Methods, ideas, or system contributions that make the work stand out.

Slow mixing retains watermark traces after edits
Imperfect quality oracles misguide attack edits
Human review reduces watermark removal success
🔎 Similar Papers
F
Fabrice Y Harel-Canada
University of California, Los Angeles
B
Boran Erol
University of California, Los Angeles
C
Connor Choi
University of California, Los Angeles
Jason Liu
Jason Liu
Professor, Florida International University
Parallel discrete-event simulationperformance modeling and simulation of networks and computer systems
G
Gary Jiarui Song
University of California, Los Angeles
N
Nanyun Peng
University of California, Los Angeles
Amit Sahai
Amit Sahai
Symantec Chair Professor of Computer Science; Professor of Mathematics (by courtesy), UCLA
CryptographyTheoretical Computer ScienceComputational ComplexitySecure Computation