🤖 AI Summary
This work introduces the first imperceptible adversarial attack paradigm targeting the “retrieval → generation” pipeline in black-box retrieval-augmented generation (RAG) systems. To manipulate RAG outputs while remaining undetectable to humans, the authors propose ReGENT—a reinforcement learning framework that jointly optimizes three objectives: retrieval relevance, generation misleadingness, and textual naturalness. ReGENT enables end-to-end perturbation optimization under black-box constraints via differentiable retrieval approximation and inverse interaction modeling. Evaluated on a newly constructed factual/non-factual question-answering benchmark, ReGENT achieves significantly higher attack success rates than prior methods across mainstream RAG systems, using minimal text perturbations (average character-level modification rate < 0.8%). Crucially, perturbed inputs retain high naturalness and readability, ensuring stealthiness without compromising linguistic fluency.
📝 Abstract
We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities. We focus on generating human-imperceptible adversarial examples and introduce a novel imperceptible retrieve-to-generate attack against RAG. This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top-$k$ candidate set, in order to influence the final answer generation. To address this task, we propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG and continuously refines attack strategies based on relevance-generation-naturalness rewards. Experiments on newly constructed factual and non-factual question-answering benchmarks demonstrate that ReGENT significantly outperforms existing attack methods in misleading RAG systems with small imperceptible text perturbations.