🤖 AI Summary
This work addresses the challenge of watermarking black-box large language models (LLMs) accessed solely via API. We propose the first provably secure, API-level watermarking scheme that embeds and detects watermarks using only standard text sampling—without requiring access to internal token probability distributions. Methodologically, we introduce an implicit distribution manipulation framework, a key-driven token biasing strategy, and a progressive hypothesis testing detection mechanism, achieving zero output distortion and supporting multi-key chained nesting. Evaluated on mainstream LLM APIs—including GPT-4, Claude, and Llama—the scheme achieves >99% detection accuracy and <0.1% false positive rate, with no degradation in text quality or diversity; in certain settings, it even outperforms white-box watermarking baselines. To our knowledge, this is the first watermarking approach for black-box LLMs that provides rigorous theoretical security guarantees while maintaining practicality and seamless deployability.
📝 Abstract
Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.