Watermark Smoothing Attacks against Language Models

📅 2024-07-19
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF

career value

187K/year
🤖 AI Summary
This work identifies a critical vulnerability in large language model (LLM) watermarking: an intrinsic coupling between watermark strength and output token confidence. To exploit this, the authors propose a confidence-guided fine-grained smoothing attack paradigm that transcends conventional token-replacement or rewriting approaches. Its core components include logits reweighting for soft smoothing, confidence-adaptive temperature scaling, and multi-watermark robustness alignment optimization. Evaluated across open-source LLMs ranging from 1.3B to 30B parameters, the method evades ten state-of-the-art watermarking schemes, reducing average detection rates by over 76%, while preserving textual fluency (BLEU and perplexity degradation <2%). This is the first systematic study to uncover and quantify the confidence–detectability linkage in LLM watermarking, providing foundational theoretical insights and an empirical benchmark for developing more robust watermark defense mechanisms.

Technology Category

Application Category

📝 Abstract
Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model's confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from $1.3$B to $30$B parameters on $10$ different watermarks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.
Problem

Research questions and friction points this paper is trying to address.

Studying vulnerabilities in AI text watermarking
Introducing Smoothing Attack to remove watermarks
Exposing weaknesses in existing watermarking schemes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Smoothing Attack removes watermarks
Leverages model confidence for attack
Validated on various model sizes