Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of retrieval-augmented generation (RAG) systems to “knowledge poisoning” attacks—where adversaries inject adversarial texts into knowledge sources to manipulate model outputs. We propose FilterRAG and its multi-layer extension ML-FilterRAG, a two-tier defense framework that exploits discriminative intrinsic properties of adversarial texts, including semantic inconsistency, low retrieval confidence, and poor generation guidance. Leveraging lightweight feature analysis and machine learning–based classification, the framework enables accurate, real-time filtering of poisoned content. Extensive experiments across multiple benchmark datasets demonstrate that our approach reduces attack success rates by over 90% while preserving ≥96% of the original RAG’s task performance—outperforming existing baselines. To the best of our knowledge, this is the first RAG security defense method grounded in explicit modeling of adversarial text properties. It delivers an interpretable, computationally efficient, and deployment-friendly solution for building trustworthy knowledge-enhanced systems.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.
Problem

Research questions and friction points this paper is trying to address.

Defending against knowledge poisoning attacks in RAG systems
Identifying adversarial texts to prevent misleading model outputs
Proposing FilterRAG and ML-FilterRAG as effective defense methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes FilterRAG and ML-FilterRAG defense methods
Identifies distinct properties to detect adversarial texts
Filters adversarial texts to maintain clean knowledge
Kennedy Edemacu
Kennedy Edemacu
Muni University, University of Arkansas
M2M communicationPrivacy and SecurityMachine Learning
V
Vinay M. Shashidhar
Department of Mathematics and Computer Science, Northern Michigan University, Marquette, MI, USA
M
Micheal Tuape
Department of Software Engineering, Lappeenranta-Lahti University of Technology, Lappeenranta, Finland
D
Dan Abudu
Graduate School of Information, Yonsei University, Seoul, South Korea
B
Beakcheol Jang
Energy and Bioproducts Research Institute, Aston University, Birmingham, U.K.
Jong Wook Kim
Jong Wook Kim
OpenAI
Music Information RetrievalMachine Learning