Discovering Spoofing Attempts on Language Model Watermarks

📅 2024-10-03
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of large language model (LLM) watermarks to learning-based spoofing attacks and the lack of reliable post-hoc detection mechanisms. We propose the first provably reliable statistical framework for posterior watermark detection. Departing from conventional front-end hardening strategies, we theoretically establish that prevalent learning-based spoofing methods universally induce quantifiable statistical anomalies—specifically, deviations in watermark sequence entropy, bias, and token-level probability distributions. Leveraging these insights, we formulate a rigorous hypothesis-testing methodology grounded in statistical inference. Our detector achieves over 95% spoofing detection accuracy against multiple state-of-the-art attacks, significantly outperforming existing baselines. To facilitate practical adoption, we open-source a comprehensive detection toolkit. This work introduces a novel, general-purpose, and empirically validated defense paradigm for trustworthy watermark deployment in LLMs.

Technology Category

Application Category

📝 Abstract
LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat. We make all our code available at https://github.com/eth-sri/watermark-spoofing-detection .
Problem

Research questions and friction points this paper is trying to address.

Detecting forged watermarks in LLM-generated text
Identifying artifacts left by learning-based spoofing methods
Providing statistical tests to reveal watermark spoofing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes statistical method to detect spoofed watermarks
Identifies artifacts in texts from learning-based spoofing
Develops rigorous tests to reveal watermark forgery
🔎 Similar Papers
2024-06-17North American Chapter of the Association for Computational LinguisticsCitations: 2