🤖 AI Summary
In open-domain question answering, conventional retrieval-augmented generation (RAG) suffers from low signal-to-noise ratio in retrieved evidence and error accumulation in multi-hop reasoning. This paper proposes EviNote-RAG, an end-to-end agent-based framework that first retrieves candidate passages, then distills key information and explicitly annotates uncertainty via Structured Evidence Notes (SEN), and finally generates answers. Its core contribution is the Evidence Quality Reward (EQR), a dense, interpretable reinforcement learning signal grounded in logical entailment, which significantly improves training stability and answer faithfulness. By unifying retrieval-augmented generation, evidence distillation, and entailment judgment, EviNote-RAG achieves substantial gains: +20% F1 on HotpotQA, +40% on Bamboogle, and +91% on 2Wiki, markedly enhancing model generalization, robustness, and response efficiency.
📝 Abstract
Large Language Models (LLMs) empowered with retrieval mechanisms have achieved strong progress in open-domain question answering (QA). Yet, the conventional retrieve--then--answer paradigm often suffers from two key limitations: (1) low signal-to-noise ratio in retrieved evidence, where useful information is buried under irrelevant content, and (2) error accumulation in multi-hop reasoning when incomplete or noisy passages are involved. To address these challenges, we present EviNote-RAG, an agentic RAG framework that introduces a structured retrieve--note--answer pipeline. Instead of directly reasoning over raw retrievals, the model is trained to compose Supportive-Evidence Notes (SENs), concise, human-like notes that preserve only answer-relevant information, highlight uncertainty, and explicitly state when no useful evidence exists. This distillation process is further reinforced by the Evidence Quality Reward (EQR), an entailment-based signal that evaluates whether SENs logically support the final answer. Together, SENs and EQR guide the model toward faithful and robust reasoning, while reducing the impact of noise. Experiments on in-domain and out-of-domain QA benchmarks show that EviNote-RAG consistently outperforms strong baselines in accuracy, generalization, and training stability. In particular, it achieves state-of-the-art results while enhancing robustness and efficiency, yielding relative F1 gains of 20% on HotpotQA (+0.093), 40% on Bamboogle (+0.151), and 91% on 2Wiki (+0.256) via denser rewards and reduced verbosity.