🤖 AI Summary
This work addresses a critical vulnerability in retrieval-augmented generation (RAG) systems: their susceptibility to knowledge poisoning attacks. While existing defenses focus solely on detecting contaminated evidence, they overlook the “monitoring-control gap”—the phenomenon where models, despite recognizing contradictions, still incorporate erroneous information into their outputs. To bridge this gap, the authors propose the Cordon principle, reframing RAG defense as an information flow control problem. Their approach employs a multi-agent architecture that isolates evidence extraction, cross-source auditing, and answer synthesis, complemented by an agent isolation mechanism based on asymmetric memory permissions. This design explicitly prohibits components with final answer synthesis capabilities from directly accessing untrusted natural language evidence. Evaluated across five BEIR datasets, the method reduces attack success rates by 92.4% compared to unprotected RAG systems.
📝 Abstract
Retrieval-augmented generation (RAG) increasingly underpins high-stakes applications, yet remains vulnerable to Confundo-style poisoning where adversarially optimized documents manipulate generated outputs. Existing defenses assume that detecting poisoned evidence prevents harm. We show this assumption is incorrect: models exhibit a monitoring-control gap -- they can detect contradictions in retrieved evidence yet still act on poisoned claims. We introduce the Cordon Principle -- no agent capable of final synthesis may access untrusted natural-language evidence -- and realize it through CORDON-MAS, a compartmentalized framework that enforces this principle architecturally by separating evidence extraction, cross-source audit, and answer synthesis into agents with asymmetric memory privileges. Across five BEIR datasets, CORDON-MAS reduces attack success rate by 92.4\% relative to undefended RAG. This reframes RAG poisoning from a detection problem to an information-flow control problem.