🤖 AI Summary
This work addresses the vulnerability of existing provably secure linguistic steganography methods based on autoregressive language models, which often fail under token-level active attacks—such as insertion, deletion, or substitution—due to error propagation. To overcome this limitation, the paper introduces diffusion language models into linguistic steganography for the first time, proposing a partially parallel generation mechanism combined with a robust embedding position selection strategy. Furthermore, it integrates pseudorandom error-correcting codes with a neighborhood search decoding algorithm to construct a steganographic system that simultaneously guarantees information-theoretic security and resilience against token-level tampering. Theoretical analysis and empirical results demonstrate that the proposed approach effectively mitigates tokenization ambiguity while significantly enhancing robustness against such attacks without compromising provable security.
📝 Abstract
Recent provably secure linguistic steganography (PSLS) methods rely on mainstream autoregressive language models (ARMs) to address historically challenging tasks, that is, to disguise covert communication as ``innocuous''natural language communication. However, due to the characteristic of sequential generation of ARMs, the stegotext generated by ARM-based PSLS methods will produce serious error propagation once it changes, making existing methods unavailable under an active tampering attack. To address this, we propose a robust, provably secure linguistic steganography with diffusion language models (DLMs). Unlike ARMs, DLMs can generate text in a partially parallel manner, allowing us to find robust positions for steganographic embedding that can be combined with error-correcting codes. Furthermore, we introduce error correction strategies, including pseudo-random error correction and neighborhood search correction, during steganographic extraction. Theoretical proof and experimental results demonstrate that our method is secure and robust. It can resist token ambiguity in stegotext segmentation and, to some extent, withstand token-level attacks of insertion, deletion, and substitution.