A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

๐Ÿ“… 2026-02-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) may employ steganography to circumvent regulatory oversight, yet existing detection approaches rely on non-accessible reference distributions from non-steganographic outputs. To address this limitation, this work proposes a decision-theoretic formal framework for LLM steganography that quantifies the information asymmetry induced by steganographic behavior without requiring such reference distributions. By modeling the disparity in informational utility between decodable and non-decodable agents, the framework introduces generalized ๐’ฑ-information and the notion of a โ€œsteganographic gap.โ€ This enables the construction of a practical monitoring mechanism tailored to LLMs. Experimental results demonstrate that the proposed method effectively detects, quantifies, and mitigates steganographic reasoning in large language models.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
Problem

Research questions and friction points this paper is trying to address.

steganography
large language models
LLM monitoring
information asymmetry
reference distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

decision-theoretic steganography
generalised V-information
steganographic gap
LLM monitoring
usable information asymmetry
๐Ÿ”Ž Similar Papers
No similar papers found.