Large Language Models as Carriers of Hidden Messages

📅 2024-06-04

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the embedding and protection of trigger-based hidden text in large language models (LLMs). Existing steganographic methods suffer from vulnerability to reverse extraction and insufficient robustness. To tackle this, we propose UTF (Unconditional Token-level Fingerprinting), a novel extraction attack, and UTFC (UTF-Consistent), a corresponding defense paradigm. UTF enables the first unconditional, token-level decoding of trigger-embedded hidden text. UTFC achieves strong robustness against UTF and multiple known attacks via token-constrained sampling, output probability modeling, and adversarial fine-tuning—without degrading the model’s general capabilities. Experiments demonstrate that UTF achieves near-perfect extraction accuracy (~100%), while UTFC preserves original model performance across mainstream benchmarks (within ±0.5%) and reduces extraction success rates to below 1%.

Technology Category

Application Category

📝 Abstract

Simple fine-tuning can embed hidden text into large language models (LLMs), which is revealed only when triggered by a specific query. Applications include LLM fingerprinting, where a unique identifier is embedded to verify licensing compliance, and steganography, where the LLM carries hidden messages disclosed through a trigger query. Our work demonstrates that embedding hidden text via fine-tuning, although seemingly secure due to the vast number of potential triggers, is vulnerable to extraction through analysis of the LLM's output decoding process. We introduce an extraction attack called Unconditional Token Forcing (UTF), which iteratively feeds tokens from the LLM's vocabulary to reveal sequences with high token probabilities, indicating hidden text candidates. We also present Unconditional Token Forcing Confusion (UTFC), a defense paradigm that makes hidden text resistant to all known extraction attacks without degrading the general performance of LLMs compared to standard fine-tuning. UTFC has both benign (improving LLM fingerprinting) and malign applications (using LLMs to create covert communication channels).

Problem

Research questions and friction points this paper is trying to address.

Detecting hidden text in fine-tuned LLMs

Vulnerability of steganography in LLMs to extraction

Defending against hidden text extraction attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning embeds hidden text in LLMs

UTF attack extracts hidden text via tokens

UTFC defends against extraction attacks effectively

🔎 Similar Papers

No similar papers found.