Optimizing Adaptive Attacks against Watermarks for Language Models

📅 2024-10-03

📈 Citations: 2

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the lack of optimizable, adaptive attacks in robustness evaluation of large language model (LLM) watermarking. We propose the first preference-driven adversarial attack framework that explicitly models watermark detection evasion as a differentiable, optimization-friendly objective. Without access to the watermark’s internal mechanism—i.e., under strict black-box assumptions—the method jointly leverages black-box watermark reverse engineering, constrained paraphrasing, and gradient-free optimization to dynamically generate semantically faithful, fluent, and highly evasive adversarial texts. Its core innovation lies in zero-shot cross-watermark generalization: the attack successfully evades all major LLM watermarking schemes—including statistical, syntactic, and embedding-based approaches—with consistently high success rates. This constitutes the first systematic empirical demonstration of structural vulnerability across state-of-the-art watermarking methods under realistic adversarial conditions.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against non-adaptive attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against any watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. We release our adaptively optimized paraphrasers at https://github.com/nilslukas/ada-wm-evasion.

Problem

Research questions and friction points this paper is trying to address.

Optimizing adaptive attacks against LLM watermarks to test robustness

Evaluating watermark evasion effectiveness across different methods

Demonstrating cost-effectiveness of optimization-based attacks on watermarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formulate watermark robustness as objective function

Use preference-based optimization for adaptive attacks

Evaluate cost-effectiveness of optimization-based attacks

🔎 Similar Papers

Watermark Smoothing Attacks against Language Models