🤖 AI Summary
Large language models (LLMs) may be exploited for covert data exfiltration in sensitive settings, posing a critical security threat. Method: We propose TrojanStego—the first steganographic trojan threat model for LLMs—achieved via malicious fine-tuning that embeds sensitive data (e.g., cryptographic keys) into natural-language outputs without input manipulation or trigger tokens. Our approach introduces a learnable, vocabulary-partitioned steganographic encoding scheme, integrated with supervised fine-tuning and majority-voting decoding. Contribution/Results: Experiments show that TrojanStego achieves 87% bit-accuracy in transmitting 32-bit keys from unseen prompts in single-generation mode, rising to 97% under triple majority voting—while preserving high linguistic quality and human imperceptibility. This work establishes the first systematic taxonomy of steganographic risks in LLMs and empirically validates their practical feasibility.
📝 Abstract
As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.