RLCracker: Exposing the Vulnerability of LLM Watermarks with Adaptive RL Attacks

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM watermarking evaluations lack adversarial rigor, leading to overestimated security guarantees. This paper exposes severe vulnerabilities of state-of-the-art watermarking schemes under adaptive attacks and proposes RLCracker—a reinforcement learning–based attack framework that requires no access to the watermark detector. RLCracker jointly optimizes adversarial context and model parameters to degrade watermark robustness. We introduce the adaptive robust radius theory to formally characterize defense degradation mechanisms. With only 100 short samples for training, RLCracker achieves 98.5% watermark removal rate and 0.92 P-SP semantic fidelity on a 3B-parameter model—substantially outperforming GPT-4o (6.75%). The method is broadly applicable across ten mainstream watermarking schemes and five model scales, establishing the first low-resource, detector-agnostic, and highly effective adaptive attack.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce adaptive robustness radius, a formal metric that quantifies watermark resilience against adaptive adversaries. We theoretically prove that optimizing the attack context and model parameters can substantially reduce this radius, making watermarks highly susceptible to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)-based adaptive attack that erases watermarks while preserving semantic fidelity. RLCracker requires only limited watermarked examples and zero access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success and an average 0.92 P-SP score on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our results confirm that adaptive attacks are broadly effective and pose a fundamental threat to current watermarking defenses.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM watermark robustness against adaptive adversarial attacks
Developing reinforcement learning methods to remove watermarks while preserving semantics
Demonstrating fundamental vulnerabilities in current AI content watermarking schemes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive RL attack erases watermarks
Optimizes attack context and model parameters
Requires limited examples and zero detector access
H
Hanbo Huang
Shanghai Jiao Tong University
Y
Yiran Zhang
Shanghai Jiao Tong University
H
Hao Zheng
Shanghai Jiao Tong University
X
Xuan Gong
Shanghai Jiao Tong University
Y
Yihan Li
National University of Defense Technology
L
Lin Liu
National University of Defense Technology
Shiyu Liang
Shiyu Liang
University of Illinois at Urbana-Champaign
Machine LearningOptimizationApplied Probability