Steganography Beyond Space-Time With Chain of Multimodal AI Agents

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
AI-generated content poses a significant threat to conventional spatial- and temporal-domain steganography by undermining cover integrity and detection robustness. Method: This paper proposes a novel cross-modal linguistic-domain steganography paradigm: audiovisual content is first decomposed into semantic text via a multi-agent chain; then, information embedding is performed through controllable probabilistic biasing during token sampling in large language models (LLMs), while simultaneously reconstructing high-fidelity audiovisual outputs. Contribution/Results: To our knowledge, this is the first work to shift steganography to the semantic linguistic layer—bypassing spatial-temporal manipulation entirely. We design an interpretable word-probability-based encoding/decoding mechanism enabling both zero-bit and multi-bit robust transmission. Experiments demonstrate strong resilience against common adversarial attacks—including compression, face-swapping, and voice conversion—while preserving facial identity, speaker identity, and semantic consistency. Statistical tests confirm indistinguishability from natural content.

Technology Category

Application Category

📝 Abstract
Steganography is the art and science of covert writing, with a broad range of applications interwoven within the realm of cybersecurity. As artificial intelligence continues to evolve, its ability to synthesise realistic content emerges as a threat in the hands of cybercriminals who seek to manipulate and misrepresent the truth. Such synthetic content introduces a non-trivial risk of overwriting the subtle changes made for the purpose of steganography. When the signals in both the spatial and temporal domains are vulnerable to unforeseen overwriting, it calls for reflection on what can remain invariant after all. This study proposes a paradigm in steganography for audiovisual media, where messages are concealed beyond both spatial and temporal domains. A chain of multimodal agents is developed to deconstruct audiovisual content into a cover text, embed a message within the linguistic domain, and then reconstruct the audiovisual content through synchronising both aural and visual modalities with the resultant stego text. The message is encoded by biasing the word sampling process of a language generation model and decoded by analysing the probability distribution of word choices. The accuracy of message transmission is evaluated under both zero-bit and multi-bit capacity settings. Fidelity is assessed through both biometric and semantic similarities, capturing the identities of the recorded face and voice, as well as the core ideas conveyed through the media. Secrecy is examined through statistical comparisons between cover and stego texts. Robustness is tested across various scenarios, including audiovisual compression, face-swapping, voice-cloning and their combinations.
Problem

Research questions and friction points this paper is trying to address.

Enhancing steganography in audiovisual media
Developing multimodal AI for message concealment
Ensuring accuracy, fidelity, and robustness in steganography
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal AI agents chain
Biased word sampling encoding
Synchronized aural and visual reconstruction
🔎 Similar Papers
No similar papers found.