🤖 AI Summary
To address fact inconsistency in retrieval-augmented generation (RAG) caused by noisy retrieved documents, this paper proposes a robust knowledge utilization framework. First, it constructs structured knowledge representations to explicitly model factual logic, enabling fine-grained error detection. Second, it introduces a dense direct preference optimization (DDPO) objective that jointly optimizes generation faithfulness and retrieval relevance. Third, it devises a semantically consistent contrastive correction data generation mechanism, synthesizing high-quality training samples with minimal human annotation. Experiments demonstrate substantial improvements in factual accuracy and cross-domain generalization across multiple large language models and evaluation scales. The framework establishes a novel paradigm for enhancing RAG robustness under low-resource conditions, offering both theoretical insight and practical utility for reliable knowledge-grounded generation.
📝 Abstract
Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access broader knowledge sources, yet factual inconsistencies persist due to noise in retrieved documents-even with advanced retrieval methods. We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance. In this paper, we present KARE-RAG (Knowledge-Aware Refinement and Enhancement for RAG), which improves knowledge utilization through three key innovations: (1) structured knowledge representations that facilitate error detection during training, (2) Dense Direct Preference Optimization (DDPO)-a refined training objective that prioritizes correction of critical errors, and (3) a contrastive data generation pipeline that maintains semantic consistency while rectifying factual inaccuracies. Experiments show our method significantly enhances standard RAG pipelines across model scales, improving both in-domain and out-of-domain task performance without compromising general capabilities. Notably, these gains are achieved with modest training data, suggesting data-efficient optimization is possible through targeted learning strategies. Our findings establish a new direction for RAG improvement: by improving how models learn to process retrieved content, we can enhance performance across diverse inference paradigms. All data and code will be publicly available on Github.