TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) may be exploited for covert data exfiltration in sensitive settings, posing a critical security threat. Method: We propose TrojanStego—the first steganographic trojan threat model for LLMs—achieved via malicious fine-tuning that embeds sensitive data (e.g., cryptographic keys) into natural-language outputs without input manipulation or trigger tokens. Our approach introduces a learnable, vocabulary-partitioned steganographic encoding scheme, integrated with supervised fine-tuning and majority-voting decoding. Contribution/Results: Experiments show that TrojanStego achieves 87% bit-accuracy in transmitting 32-bit keys from unseen prompts in single-generation mode, rising to 97% under triple majority voting—while preserving high linguistic quality and human imperceptibility. This work establishes the first systematic taxonomy of steganographic risks in LLMs and empirically validates their practical feasibility.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.
Problem

Research questions and friction points this paper is trying to address.

LLMs can leak confidential data via steganography
Adversaries embed sensitive info in natural outputs
Stego attacks are covert, practical, and dangerous
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes LLM for steganographic data embedding
Uses vocabulary partitioning for secret encoding
Achieves high accuracy with majority voting
🔎 Similar Papers
No similar papers found.