In-Context Watermarks for Large Language Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

In real-world scenarios where access to large language model (LLM) decoding internals is infeasible—e.g., academic peer review—AI-generated text becomes difficult to attribute reliably. Method: This paper proposes In-Context Watermarking (ICW), a novel, prompt-only, model-agnostic watermarking paradigm. ICW implicitly embeds detectable watermarks into generated text via instruction following and in-context learning, enabling fine-grained policy customization and covert activation under indirect prompt injection (IPI). It employs multi-granularity encoding and lightweight statistical/classification-based detectors. Contribution/Results: Evaluated across multiple mainstream LLMs, ICW achieves watermark detection rates exceeding 92%, demonstrates strong robustness against deletion and rewriting, and improves performance with increasing model capability. Crucially, ICW enables reliable AI content attribution without requiring access to model weights, APIs, or internal decoding states—marking the first such solution achieving practical, access-free provenance tracing.

Technology Category

Application Category

📝 Abstract

The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution.

Problem

Research questions and friction points this paper is trying to address.

Ensuring provenance of AI-generated text without decoding access

Detecting AI-generated content in academic peer review

Developing model-agnostic watermarking via prompt engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Watermarking via prompt engineering

Model-agnostic watermarking without decoding access

Covert triggering through Indirect Prompt Injection

🔎 Similar Papers

No similar papers found.