🤖 AI Summary
This work addresses the vulnerability of retrieval-augmented generation (RAG) systems to “knowledge poisoning” attacks—where adversaries inject adversarial texts into knowledge sources to manipulate model outputs. We propose FilterRAG and its multi-layer extension ML-FilterRAG, a two-tier defense framework that exploits discriminative intrinsic properties of adversarial texts, including semantic inconsistency, low retrieval confidence, and poor generation guidance. Leveraging lightweight feature analysis and machine learning–based classification, the framework enables accurate, real-time filtering of poisoned content. Extensive experiments across multiple benchmark datasets demonstrate that our approach reduces attack success rates by over 90% while preserving ≥96% of the original RAG’s task performance—outperforming existing baselines. To the best of our knowledge, this is the first RAG security defense method grounded in explicit modeling of adversarial text properties. It delivers an interpretable, computationally efficient, and deployment-friendly solution for building trustworthy knowledge-enhanced systems.
📝 Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.