Large Language Models as Carriers of Hidden Messages

📅 2024-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the embedding and protection of trigger-based hidden text in large language models (LLMs). Existing steganographic methods suffer from vulnerability to reverse extraction and insufficient robustness. To tackle this, we propose UTF (Unconditional Token-level Fingerprinting), a novel extraction attack, and UTFC (UTF-Consistent), a corresponding defense paradigm. UTF enables the first unconditional, token-level decoding of trigger-embedded hidden text. UTFC achieves strong robustness against UTF and multiple known attacks via token-constrained sampling, output probability modeling, and adversarial fine-tuning—without degrading the model’s general capabilities. Experiments demonstrate that UTF achieves near-perfect extraction accuracy (~100%), while UTFC preserves original model performance across mainstream benchmarks (within ±0.5%) and reduces extraction success rates to below 1%.

Technology Category

Application Category

📝 Abstract
Simple fine-tuning can embed hidden text into large language models (LLMs), which is revealed only when triggered by a specific query. Applications include LLM fingerprinting, where a unique identifier is embedded to verify licensing compliance, and steganography, where the LLM carries hidden messages disclosed through a trigger query. Our work demonstrates that embedding hidden text via fine-tuning, although seemingly secure due to the vast number of potential triggers, is vulnerable to extraction through analysis of the LLM's output decoding process. We introduce an extraction attack called Unconditional Token Forcing (UTF), which iteratively feeds tokens from the LLM's vocabulary to reveal sequences with high token probabilities, indicating hidden text candidates. We also present Unconditional Token Forcing Confusion (UTFC), a defense paradigm that makes hidden text resistant to all known extraction attacks without degrading the general performance of LLMs compared to standard fine-tuning. UTFC has both benign (improving LLM fingerprinting) and malign applications (using LLMs to create covert communication channels).
Problem

Research questions and friction points this paper is trying to address.

Detecting hidden text in fine-tuned LLMs
Vulnerability of steganography in LLMs to extraction
Defending against hidden text extraction attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning embeds hidden text in LLMs
UTF attack extracts hidden text via tokens
UTFC defends against extraction attacks effectively
🔎 Similar Papers
No similar papers found.
Jakub Hoscilowicz
Jakub Hoscilowicz
Samsung Research, Warsaw University of Technology
AI AgentsMultimodal LLM
P
Paweł Popiołek
Institute of Telecommunications, Warsaw University of Technology
J
Jan Rudkowski
Institute of Telecommunications, Warsaw University of Technology
J
Jȩdrzej Bieniasz
Institute of Telecommunications, Warsaw University of Technology
Artur Janicki
Artur Janicki
Warsaw University of Technology
speech processingsignal processingcybersecuritymachine learning