Adversarial Decoding: Generating Readable Documents for Adversarial Objectives

📅 2024-10-03

📈 Citations: 1

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing adversarial text generation methods suffer from poor readability, difficulty in simultaneously optimizing embedding similarity, and limited support for direct attacks (e.g., jailbreaking) only—rendering them inadequate for realistic threats such as RAG poisoning, indirect prompt injection, and filter evasion. This paper proposes Adversarial Decoding, the first framework unifying semantic controllability, embedding-space alignment, and high readability. It achieves this via gradient-guided token-level optimization, embedding projection constraints, joint fine-tuning of internal language model representations, and controllable decoding search. Experiments demonstrate substantial improvements: attack success rates increase by 37–64% across RAG poisoning, jailbreaking, and filter bypass tasks, while human readability reaches 92%, significantly outperforming baselines. Crucially, our findings reveal the critical insight that low perplexity does not imply robustness against detection.

Technology Category

Application Category

📝 Abstract

We design, implement, and evaluate adversarial decoding, a new, generic text generation technique that produces readable documents for different adversarial objectives. Prior methods either produce easily detectable gibberish, or cannot handle objectives that include embedding similarity. In particular, they only work for direct attacks (such as jailbreaking) and cannot produce adversarial text for realistic indirect injection, e.g., documents that (1) are retrieved in RAG systems in response to broad classes of queries, and also (2) adversarially influence subsequent generation. We also show that fluency (low perplexity) is not sufficient to evade filtering. We measure the effectiveness of adversarial decoding for different objectives, including RAG poisoning, jailbreaking, and evasion of defensive filters, and demonstrate that it outperforms existing methods while producing readable adversarial documents.

Problem

Research questions and friction points this paper is trying to address.

Generates readable adversarial text for diverse objectives

Overcomes limitations of prior methods in embedding similarity

Effective in RAG poisoning, jailbreaking, and filter evasion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates readable adversarial documents

Handles embedding similarity objectives

Outperforms existing methods in RAG poisoning

🔎 Similar Papers

No similar papers found.