Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of traditional reinforcement learning–based Agentic RAG approaches, which suffer from sparse rewards and low sample efficiency, hindering effective utilization of intermediate signals during reasoning. To overcome these challenges, the authors propose Search-P1, a novel framework that introduces a path-centric reward shaping mechanism. This mechanism integrates order-agnostic step coverage, a soft scoring strategy, and an offline-generated reference planner. Furthermore, Search-P1 employs a dual-track path scoring method to extract informative learning signals from failed trajectories, jointly evaluating reasoning path quality through both self-consistency and reference alignment perspectives. Experimental results across multiple question-answering benchmarks demonstrate that Search-P1 achieves an average accuracy improvement of 7.7 percentage points, significantly outperforming strong baselines such as Search-R1.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.

Problem

Research questions and friction points this paper is trying to address.

Agentic RAG

reward sparsity

sample efficiency

reasoning trajectories

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Path-Centric Reward

Reward Shaping

Agentic RAG

Dual-Track Scoring

Reasoning Trajectory

🔎 Similar Papers

No similar papers found.

Authors to Follow