Adversarial Decoding: Generating Readable Documents for Adversarial Objectives

๐Ÿ“… 2024-10-03
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing adversarial text generation methods suffer from poor readability, difficulty in simultaneously optimizing embedding similarity, and limited support for direct attacks (e.g., jailbreaking) onlyโ€”rendering them inadequate for realistic threats such as RAG poisoning, indirect prompt injection, and filter evasion. This paper proposes Adversarial Decoding, the first framework unifying semantic controllability, embedding-space alignment, and high readability. It achieves this via gradient-guided token-level optimization, embedding projection constraints, joint fine-tuning of internal language model representations, and controllable decoding search. Experiments demonstrate substantial improvements: attack success rates increase by 37โ€“64% across RAG poisoning, jailbreaking, and filter bypass tasks, while human readability reaches 92%, significantly outperforming baselines. Crucially, our findings reveal the critical insight that low perplexity does not imply robustness against detection.

Technology Category

Application Category

๐Ÿ“ Abstract
We design, implement, and evaluate adversarial decoding, a new, generic text generation technique that produces readable documents for different adversarial objectives. Prior methods either produce easily detectable gibberish, or cannot handle objectives that include embedding similarity. In particular, they only work for direct attacks (such as jailbreaking) and cannot produce adversarial text for realistic indirect injection, e.g., documents that (1) are retrieved in RAG systems in response to broad classes of queries, and also (2) adversarially influence subsequent generation. We also show that fluency (low perplexity) is not sufficient to evade filtering. We measure the effectiveness of adversarial decoding for different objectives, including RAG poisoning, jailbreaking, and evasion of defensive filters, and demonstrate that it outperforms existing methods while producing readable adversarial documents.
Problem

Research questions and friction points this paper is trying to address.

Generates readable adversarial text for diverse objectives
Overcomes limitations of prior methods in embedding similarity
Effective in RAG poisoning, jailbreaking, and filter evasion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates readable adversarial documents
Handles embedding similarity objectives
Outperforms existing methods in RAG poisoning
๐Ÿ”Ž Similar Papers
No similar papers found.