๐ค AI Summary
Existing adversarial text generation methods suffer from poor readability, difficulty in simultaneously optimizing embedding similarity, and limited support for direct attacks (e.g., jailbreaking) onlyโrendering them inadequate for realistic threats such as RAG poisoning, indirect prompt injection, and filter evasion. This paper proposes Adversarial Decoding, the first framework unifying semantic controllability, embedding-space alignment, and high readability. It achieves this via gradient-guided token-level optimization, embedding projection constraints, joint fine-tuning of internal language model representations, and controllable decoding search. Experiments demonstrate substantial improvements: attack success rates increase by 37โ64% across RAG poisoning, jailbreaking, and filter bypass tasks, while human readability reaches 92%, significantly outperforming baselines. Crucially, our findings reveal the critical insight that low perplexity does not imply robustness against detection.
๐ Abstract
We design, implement, and evaluate adversarial decoding, a new, generic text generation technique that produces readable documents for different adversarial objectives. Prior methods either produce easily detectable gibberish, or cannot handle objectives that include embedding similarity. In particular, they only work for direct attacks (such as jailbreaking) and cannot produce adversarial text for realistic indirect injection, e.g., documents that (1) are retrieved in RAG systems in response to broad classes of queries, and also (2) adversarially influence subsequent generation. We also show that fluency (low perplexity) is not sufficient to evade filtering. We measure the effectiveness of adversarial decoding for different objectives, including RAG poisoning, jailbreaking, and evasion of defensive filters, and demonstrate that it outperforms existing methods while producing readable adversarial documents.