Lost in Overlap: Exploring Logit-based Watermark Collision in LLMs

📅 2024-03-15

📈 Citations: 2

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work identifies a cross-model watermark collision problem in logit-based watermarking for large language models (LLMs): benign, watermark-free texts generated by one model may inadvertently trigger the watermark detector of another, causing false positives. This issue is pervasive in downstream tasks such as translation and paraphrasing, severely undermining watermark reliability for copyright protection and content provenance. The authors formally define watermark collision as a novel, generic adversarial paradigm and provide theoretical proof that it universally threatens all logit-based watermarking schemes. Through cross-model embedding/detection experiments and multi-task empirical evaluation across mainstream LLMs, they demonstrate the prevalence and severity of such collisions—reducing watermark detection accuracy by over 40% on average. Moving beyond conventional targeted attacks, this work establishes a new robustness benchmark and offers a foundational perspective for watermark resilience research in LLMs.

Technology Category

Application Category

📝 Abstract

The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread usage of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks, such as paraphrasing or translation. In this paper, we introduce watermark collision as a novel and general philosophy for watermark attacks, aimed at enhancing attack performance on top of any other attacking methods. We also provide a comprehensive demonstration that watermark collision poses a threat to all logit-based watermark algorithms, impacting not only specific attack scenarios but also downstream applications.

Problem

Research questions and friction points this paper is trying to address.

Exploring watermark collision in LLMs

Enhancing attack performance on watermarks

Threat to logit-based watermark algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces watermark collision concept

Enhances attack performance universally

Threatens all logit-based watermarks

🔎 Similar Papers

Can Watermarked LLMs be Identified by Users via Crafted Prompts?