Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of large language model (LLM) watermarks to semantic adversarial attacks—such as malicious injection of toxic content or sentiment reversal—that degrade watermark integrity and harm model providers’ reputations. We propose a post-hoc, semantics-aware watermarking method that operates without modifying the LLM architecture. Its core innovation is the first integration of contrastive representation learning into watermark design, enabling automatic construction of semantic mapping models that generate green/red token sets highly sensitive to semantic corruption yet robust to meaning-preserving edits. Watermark embedding is performed via semantic-constrained autoregressive editing applied to generated text. Experiments on two benchmark datasets demonstrate that our method significantly resists toxic content insertion and sentiment reversal attacks while maintaining high watermark detection accuracy and strong resilience against removal attempts.

Technology Category

Application Category

📝 Abstract
Watermarking has emerged as a promising technique for detecting texts generated by LLMs. Current research has primarily focused on three design criteria: high quality of the watermarked text, high detectability, and robustness against removal attack. However, the security against spoofing attacks remains relatively understudied. For example, a piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark, thereby damaging the reputation of the LLM provider. We identify two core challenges that make defending against spoofing difficult: (1) the need for watermarks to be both sensitive to semantic-distorting changes and insensitive to semantic-preserving edits, and (2) the contradiction between the need to detect global semantic shifts and the local, auto-regressive nature of most watermarking schemes. To address these challenges, we propose a semantic-aware watermarking algorithm that post-hoc embeds watermarks into a given target text while preserving its original meaning. Our method introduces a semantic mapping model, which guides the generation of a green-red token list, contrastively trained to be sensitive to semantic-distorting changes and insensitive to semantic-preserving changes. Experiments on two standard benchmarks demonstrate strong robustness against removal attacks and security against spoofing attacks, including sentiment reversal and toxic content insertion, while maintaining high watermark detectability. Our approach offers a significant step toward more secure and semantically aware watermarking for LLMs. Our code is available at https://github.com/UCSB-NLP-Chang/contrastive-watermark.
Problem

Research questions and friction points this paper is trying to address.

Defending LLM watermarking against spoofing attacks
Balancing watermark sensitivity to semantic changes
Addressing contradiction in detecting global semantic shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive representation learning for watermarking
Semantic-aware watermarking algorithm
Green-red token list generation