Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

📅 2024-09-03
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG methods predominantly rely on single knowledge sources, rendering them inadequate for real-world scenarios requiring integration of heterogeneous, cross-domain knowledge—spanning structured (e.g., databases, tables) and unstructured (e.g., text, images) modalities—while lacking standardized benchmarks and systematic evaluation. Method: We formally define the multi-source knowledge RAG task and introduce PruningRAG, a novel framework featuring a three-stage, multi-granularity knowledge filtering mechanism: semantic similarity pruning, domain-adaptive filtering, and hierarchical attention distillation—dynamically suppressing noise and redundancy. Contribution/Results: We release the first open-source benchmark encompassing multimodal, cross-domain, structured, and unstructured knowledge. On this benchmark, PruningRAG achieves an average 12.7% improvement in QA accuracy and a 31.4% reduction in hallucination rate over state-of-the-art RAG baselines.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach to mitigating the hallucination of large language models (LLMs) through the integration of external knowledge. While numerous efforts, most studies focus on a single type of external knowledge source. In contrast, most real-world applications involve diverse knowledge from various sources, a scenario that has been relatively underexplored. The main dilemma is the lack of a suitable dataset incorporating multiple knowledge sources and pre-exploration of the associated issues. To address these challenges, we standardize a benchmark dataset that combines structured and unstructured knowledge across diverse and complementary domains. Building upon the dataset, we identify the limitations of existing methods under such conditions. Therefore, we develop PruningRAG, a plug-and-play RAG framework that uses multi-granularity pruning strategies to more effectively incorporate relevant context and mitigate the negative impact of misleading information. Extensive experimental results demonstrate superior performance of PruningRAG and our insightful findings are also reported. Our dataset and code are publicly availablefootnote{https://github.com/USTCAGI/PruningRAG}.
Problem

Research questions and friction points this paper is trying to address.

Multi-source knowledge integration in RAG
Mitigating misinformation in language models
Developing a benchmark for diverse knowledge sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-source knowledge integration
Multi-granularity pruning strategies
Standardized benchmark dataset
🔎 Similar Papers
2024-06-01International Conference on Computational LinguisticsCitations: 4
S
Shuo Yu
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
M
Mingyue Cheng
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
J
Jiqian Yang
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
O
Ouyang Jie
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
Y
Yucong Luo
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
Chenyi Lei
Chenyi Lei
Kuaishou Technology
Recommender SystemInformation RetrievalGenerative RecommendationMultimodal
Q
Qi Liu
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
Enhong Chen
Enhong Chen
University of Science and Technology of China
data miningrecommender systemmachine learning