🤖 AI Summary
This work addresses content poisoning attacks against black-box retrieval-augmented generation (RAG) question-answering systems. We propose RIPRAG, the first attack framework that operates without access to internal system components—relying solely on final output feedback. RIPRAG employs end-to-end reinforcement learning to optimize the generation of malicious documents that steer large language models (LLMs) toward attacker-preferred responses. Its key contributions are: (1) the first effective poisoning attack against multi-stage RAG systems under a fully black-box setting—where both the retrieval mechanism and RAG architecture are unknown; and (2) an adaptive attack paradigm guided by sparse success signals, eliminating reliance on gradients or intermediate outputs. Experiments across diverse, complex RAG systems demonstrate that RIPRAG achieves up to 0.72 higher attack success rate than state-of-the-art baselines, revealing critical vulnerabilities in existing defenses.
📝 Abstract
Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become a core technology for tasks such as question-answering (QA) and content generation. However, by injecting poisoned documents into the database of RAG systems, attackers can manipulate LLMs to generate text that aligns with their intended preferences. Existing research has primarily focused on white-box attacks against simplified RAG architectures. In this paper, we investigate a more complex and realistic scenario: the attacker lacks knowledge of the RAG system's internal composition and implementation details, and the RAG system comprises components beyond a mere retriever. Specifically, we propose the RIPRAG attack framework, an end-to-end attack pipeline that treats the target RAG system as a black box, where the only information accessible to the attacker is whether the poisoning succeeds. Our method leverages Reinforcement Learning (RL) to optimize the generation model for poisoned documents, ensuring that the generated poisoned document aligns with the target RAG system's preferences. Experimental results demonstrate that this method can effectively execute poisoning attacks against most complex RAG systems, achieving an attack success rate (ASR) improvement of up to 0.72 compared to baseline methods. This highlights prevalent deficiencies in current defensive methods and provides critical insights for LLM security research.