🤖 AI Summary
This work identifies the “RAG paradox”: while retrieval-augmented generation (RAG) systems enhance trustworthiness through source citation, such transparency inadvertently exposes exploitable vulnerabilities to black-box attacks. To address this, we propose the first realistic black-box attack paradigm that requires no internal model access—automatically identifying publicly cited documents in RAG outputs, generating and publishing adversarial “poisoned” documents containing misleading information, and thereby corrupting both retrieval and generation. Experiments demonstrate significant degradation in response accuracy and factual consistency across diverse RAG systems. Further, we introduce an intrinsic defense based on retrieval re-ranking, which improves robustness without modifying the underlying LLM or retriever. This is the first systematic study to formally characterize, empirically validate, and mitigate black-box poisoning threats against RAG systems.
📝 Abstract
With the growing adoption of retrieval-augmented generation (RAG) systems, recent studies have introduced attack methods aimed at degrading their performance. However, these methods rely on unrealistic white-box assumptions, such as attackers having access to RAG systems' internal processes. To address this issue, we introduce a realistic black-box attack scenario based on the RAG paradox, where RAG systems inadvertently expose vulnerabilities while attempting to enhance trustworthiness. Because RAG systems reference external documents during response generation, our attack targets these sources without requiring internal access. Our approach first identifies the external sources disclosed by RAG systems and then automatically generates poisoned documents with misinformation designed to match these sources. Finally, these poisoned documents are newly published on the disclosed sources, disrupting the RAG system's response generation process. Both offline and online experiments confirm that this attack significantly reduces RAG performance without requiring internal access. Furthermore, from an insider perspective within the RAG system, we propose a re-ranking method that acts as a fundamental safeguard, offering minimal protection against unforeseen attacks.