Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

RAG systems are vulnerable to knowledge base poisoning attacks, yet existing methods struggle to precisely localize malicious text responsible for erroneous outputs. To address this, we propose the first black-box attribution framework that operates without access to model internals. Our approach jointly models retrieval ranking bias, semantic relevance anomalies, and generative output perturbations to dynamically delineate the attribution scope for individual hallucination events; it further isolates poisoned content via unsupervised clustering. The framework demonstrates robustness under adaptive attacks and multi-attacker settings. Extensive experiments across seven datasets and fifteen distinct poisoning attack types show that our method significantly outperforms state-of-the-art baselines, achieving high-precision identification of contaminated knowledge entries. By providing interpretable, fine-grained attribution diagnostics, our work advances the security and accountability of RAG systems.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) integrates external knowledge into large language models to improve response quality. However, recent work has shown that RAG systems are highly vulnerable to poisoning attacks, where malicious texts are inserted into the knowledge database to influence model outputs. While several defenses have been proposed, they are often circumvented by more adaptive or sophisticated attacks. This paper presents RAGOrigin, a black-box responsibility attribution framework designed to identify which texts in the knowledge database are responsible for misleading or incorrect generations. Our method constructs a focused attribution scope tailored to each misgeneration event and assigns a responsibility score to each candidate text by evaluating its retrieval ranking, semantic relevance, and influence on the generated response. The system then isolates poisoned texts using an unsupervised clustering method. We evaluate RAGOrigin across seven datasets and fifteen poisoning attacks, including newly developed adaptive poisoning strategies and multi-attacker scenarios. Our approach outperforms existing baselines in identifying poisoned content and remains robust under dynamic and noisy conditions. These results suggest that RAGOrigin provides a practical and effective solution for tracing the origins of corrupted knowledge in RAG systems.

Problem

Research questions and friction points this paper is trying to address.

Identifying poisoned texts causing incorrect outputs in RAG systems

Attributing responsibility for misleading generations to database sources

Detecting malicious knowledge injections through unsupervised clustering analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box responsibility attribution framework

Unsupervised clustering isolates poisoned texts

Evaluates retrieval ranking and semantic relevance

🔎 Similar Papers

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence