When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing jailbreak attacks struggle to simultaneously conceal malicious intent (toxicity) and preserve linguistic naturalness, rendering them vulnerable to safety detection mechanisms. This paper proposes StegoAttack—the first fully implicit jailbreaking framework achieving dual concealment of both toxicity and linguistic fluency. It encodes adversarial queries as semantically natural text-based steganographic payloads, prompting large language models (LLMs) to autonomously decode and execute malicious instructions while evading both internal alignment safeguards and external content filters. Methodologically, StegoAttack introduces a novel three-stage paradigm: steganographic encoding, semantic-preserving embedding, instruction-guided payload extraction, and multi-layer adversarial prompt engineering. Evaluated on four major vendor-aligned LLMs, it achieves an average attack success rate (ASR) of 92.0%, outperforming state-of-the-art baselines by 11.0 percentage points. Under external detection (e.g., Llama Guard), ASR degradation is less than 1%, and its overall stealthiness sets a new state-of-the-art.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks pose a serious threat to large language models (LLMs) by bypassing built-in safety mechanisms and leading to harmful outputs. Studying these attacks is crucial for identifying vulnerabilities and improving model security. This paper presents a systematic survey of jailbreak methods from the novel perspective of stealth. We find that existing attacks struggle to simultaneously achieve toxic stealth (concealing toxic content) and linguistic stealth (maintaining linguistic naturalness). Motivated by this, we propose StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide the harmful query within benign, semantically coherent text. The attack then prompts the LLM to extract the hidden query and respond in an encrypted manner. This approach effectively hides malicious intent while preserving naturalness, allowing it to evade both built-in and external safety mechanisms. We evaluate StegoAttack on four safety-aligned LLMs from major providers, benchmarking against eight state-of-the-art methods. StegoAttack achieves an average attack success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%. Its ASR drops by less than 1% even under external detection (e.g., Llama Guard). Moreover, it attains the optimal comprehensive scores on stealth detection metrics, demonstrating both high efficacy and exceptional stealth capabilities. The code is available at https://anonymous.4open.science/r/StegoAttack-Jail66

Problem

Research questions and friction points this paper is trying to address.

Jailbreak attacks bypass LLM safety mechanisms causing harm

Existing attacks lack toxic and linguistic stealth simultaneously

StegoAttack uses steganography to hide harmful queries effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Steganography hides harmful queries in benign text

LLM extracts hidden query and responds encrypted

Achieves high attack success with exceptional stealth

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation