Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional RAG systems suffer from inefficient exploration, gradient conflicts, and sparse rewards in multi-step reasoning due to static pipelines and coarse-grained, outcome-level reward signals. To address these issues, this paper proposes RAG-ProGuide—the first reinforcement learning framework guided by fine-grained, process-level rewards. Its core contributions include: (i) constructing a high-quality supervision dataset covering three sequential stages—query generation, evidence retrieval, and answer synthesis; (ii) designing a multi-stage decoupled agent architecture; and (iii) introducing a retrieval-reasoning co-finetuning mechanism. Experiments across five benchmarks demonstrate that RAG-ProGuide significantly outperforms baselines such as Search-R1. It achieves state-of-the-art performance with only 5K training samples—reducing data requirements by 94%—while enhancing training stability, accelerating convergence, and improving sample efficiency.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs) by integrating external knowledge and up-to-date information. However, traditional RAG systems are limited by static workflows and lack the adaptability required for multistep reasoning and complex task management. To address these limitations, agentic RAG systems (e.g., DeepResearch) have been proposed, enabling dynamic retrieval strategies, iterative context refinement, and adaptive workflows for handling complex search queries beyond the capabilities of conventional RAG. Recent advances, such as Search-R1, have demonstrated promising gains using outcome-based reinforcement learning, where the correctness of the final answer serves as the reward signal. Nevertheless, such outcome-supervised agentic RAG methods face challenges including low exploration efficiency, gradient conflict, and sparse reward signals. To overcome these challenges, we propose to utilize fine-grained, process-level rewards to improve training stability, reduce computational costs, and enhance efficiency. Specifically, we introduce a novel method ReasonRAG that automatically constructs RAG-ProGuide, a high-quality dataset providing process-level rewards for (i) query generation, (ii) evidence extraction, and (iii) answer generation, thereby enhancing model inherent capabilities via process-supervised reinforcement learning. With the process-level policy optimization, the proposed framework empowers LLMs to autonomously invoke search, generate queries, extract relevant evidence, and produce final answers. Compared to existing approaches such as Search-R1 and traditional RAG systems, ReasonRAG, leveraging RAG-ProGuide, achieves superior performance on five benchmark datasets using only 5k training instances, significantly fewer than the 90k training instances required by Search-R1.

Problem

Research questions and friction points this paper is trying to address.

Enhancing agentic RAG systems with dynamic retrieval strategies

Addressing low exploration efficiency in outcome-supervised RAG methods

Improving training stability via process-level reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process-supervised reinforcement learning for RAG

Dynamic retrieval and iterative context refinement

Autonomous search and evidence extraction

🔎 Similar Papers

No similar papers found.

Authors to Follow