Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

๐Ÿ“… 2026-02-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of traditional reinforcement learningโ€“based Agentic RAG approaches, which suffer from sparse rewards and low sample efficiency, hindering effective utilization of intermediate signals during reasoning. To overcome these challenges, the authors propose Search-P1, a novel framework that introduces a path-centric reward shaping mechanism. This mechanism integrates order-agnostic step coverage, a soft scoring strategy, and an offline-generated reference planner. Furthermore, Search-P1 employs a dual-track path scoring method to extract informative learning signals from failed trajectories, jointly evaluating reasoning path quality through both self-consistency and reference alignment perspectives. Experimental results across multiple question-answering benchmarks demonstrate that Search-P1 achieves an average accuracy improvement of 7.7 percentage points, significantly outperforming strong baselines such as Search-R1.

Technology Category

Application Category

๐Ÿ“ Abstract
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.
Problem

Research questions and friction points this paper is trying to address.

Agentic RAG
reward sparsity
sample efficiency
reasoning trajectories
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Path-Centric Reward
Reward Shaping
Agentic RAG
Dual-Track Scoring
Reasoning Trajectory
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tianle Xia
Tencent
M
Ming Xu
Tencent
L
Lingxiang Hu
Tencent
Yiding Sun
Yiding Sun
Renmin University of China
Large Language ModelsExplainable Recommendation
W
Wenwei Li
Tencent
L
Linfang Shang
Tencent
L
Liqun Liu
Tencent
P
Peng Shu
Tencent
H
Huan Yu
Tencent
J
Jie Jiang
Tencent