๐ค AI Summary
This work addresses the limitations of traditional reinforcement learningโbased Agentic RAG approaches, which suffer from sparse rewards and low sample efficiency, hindering effective utilization of intermediate signals during reasoning. To overcome these challenges, the authors propose Search-P1, a novel framework that introduces a path-centric reward shaping mechanism. This mechanism integrates order-agnostic step coverage, a soft scoring strategy, and an offline-generated reference planner. Furthermore, Search-P1 employs a dual-track path scoring method to extract informative learning signals from failed trajectories, jointly evaluating reasoning path quality through both self-consistency and reference alignment perspectives. Experimental results across multiple question-answering benchmarks demonstrate that Search-P1 achieves an average accuracy improvement of 7.7 percentage points, significantly outperforming strong baselines such as Search-R1.
๐ Abstract
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.