ArcMark: Multi-bit LLM Watermark via Optimal Transport

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of information-theoretic foundations in existing multi-bit watermarking methods for large language models, which struggle to maximize embedding efficiency without perturbing the token prediction distribution. By formulating watermark embedding as a channel coding problem, this study derives—for the first time—the information-theoretic capacity of multi-bit watermarking. Building on optimal transport and probability distribution modulation, the authors propose ArcMark, a novel scheme that achieves capacity-approaching embedding while preserving the statistical properties of generated text and enabling robust detection. Experimental results demonstrate that ArcMark significantly outperforms current approaches in both bits per token and detection accuracy, thereby validating the effectiveness of leveraging coding theory in watermark design.

Technology Category

Application Category

📝 Abstract
Watermarking is an important tool for promoting the responsible use of language models (LMs). Existing watermarks insert a signal into generated tokens that either flags LM-generated text (zero-bit watermarking) or encodes more complex messages (multi-bit watermarking). Though a number of recent multi-bit watermarks insert several bits into text without perturbing average next-token predictions, they largely extend design principles from the zero-bit setting, such as encoding a single bit per token. Notably, the information-theoretic capacity of multi-bit watermarking -- the maximum number of bits per token that can be inserted and detected without changing average next-token predictions -- has remained unknown. We address this gap by deriving the first capacity characterization of multi-bit watermarks. Our results inform the design of ArcMark: a new watermark construction based on coding-theoretic principles that, under certain assumptions, achieves the capacity of the multi-bit watermark channel. In practice, ArcMark outperforms competing multi-bit watermarks in terms of bit rate per token and detection accuracy. Our work demonstrates that LM watermarking is fundamentally a channel coding problem, paving the way for principled coding-theoretic approaches to watermark design.
Problem

Research questions and friction points this paper is trying to address.

multi-bit watermarking
language models
information-theoretic capacity
watermarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-bit watermarking
optimal transport
channel capacity
language model watermarking
coding theory
🔎 Similar Papers
No similar papers found.