Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the vulnerability of retrieval-augmented generation (RAG) systems to “knowledge poisoning” attacks—where adversaries inject adversarial texts into knowledge sources to manipulate model outputs. We propose FilterRAG and its multi-layer extension ML-FilterRAG, a two-tier defense framework that exploits discriminative intrinsic properties of adversarial texts, including semantic inconsistency, low retrieval confidence, and poor generation guidance. Leveraging lightweight feature analysis and machine learning–based classification, the framework enables accurate, real-time filtering of poisoned content. Extensive experiments across multiple benchmark datasets demonstrate that our approach reduces attack success rates by over 90% while preserving ≥96% of the original RAG’s task performance—outperforming existing baselines. To the best of our knowledge, this is the first RAG security defense method grounded in explicit modeling of adversarial text properties. It delivers an interpretable, computationally efficient, and deployment-friendly solution for building trustworthy knowledge-enhanced systems.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.

Problem

Research questions and friction points this paper is trying to address.

Defending against knowledge poisoning attacks in RAG systems

Identifying adversarial texts to prevent misleading model outputs

Proposing FilterRAG and ML-FilterRAG as effective defense methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes FilterRAG and ML-FilterRAG defense methods

Identifies distinct properties to detect adversarial texts

Filters adversarial texts to maintain clean knowledge

🔎 Similar Papers

On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains