A Watermark for Black-Box Language Models

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of watermarking black-box large language models (LLMs) accessed solely via API. We propose the first provably secure, API-level watermarking scheme that embeds and detects watermarks using only standard text sampling—without requiring access to internal token probability distributions. Methodologically, we introduce an implicit distribution manipulation framework, a key-driven token biasing strategy, and a progressive hypothesis testing detection mechanism, achieving zero output distortion and supporting multi-key chained nesting. Evaluated on mainstream LLM APIs—including GPT-4, Claude, and Llama—the scheme achieves >99% detection accuracy and <0.1% false positive rate, with no degradation in text quality or diversity; in certain settings, it even outperforms white-box watermarking baselines. To our knowledge, this is the first watermarking approach for black-box LLMs that provides rigorous theoretical security guarantees while maintaining practicality and seamless deployability.

Technology Category

Application Category

📝 Abstract

Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

Problem

Research questions and friction points this paper is trying to address.

Detect LLM outputs without white-box access

Develop distortion-free black-box watermarking

Enable multi-key chaining for watermarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box watermarking for LLMs

Distortion-free watermarking scheme

Multiple secret keys support

🔎 Similar Papers

Publicly Detectable Watermarking for Language Models