TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge of credit assignment in reinforcement learning (RL)-based retrieval-augmented generation (RAG) agents, which typically rely solely on sparse final rewards and struggle to attribute credit to intermediate reasoning and retrieval steps. While existing process supervision approaches mitigate this issue, they often depend on offline data or costly human annotations and are vulnerable to distributional shift. To overcome these limitations, we propose an online tree-based process supervision framework that models the reasoning trajectory as an unfolding tree and leverages Monte Carlo estimates of descendant outcomes to enable fine-grained, step-level credit assignment using only final rewards—without requiring intermediate labels. Combined with efficient tree construction and advantage function optimization, our method achieves significant performance gains over current outcome- and process-supervised RL baselines across seven multi-hop and general question-answering benchmarks spanning diverse model scales, matching the inference cost of Search-R1.

Technology Category

Application Category

📝 Abstract

Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval, and has recently been advanced by reinforcement learning (RL) with outcome-based supervision. While effective, relying solely on sparse final rewards limits step-wise credit assignment and provides weak guidance for intermediate reasoning and actions. Recent efforts explore process-level supervision, but typically depend on offline constructed training data, which risks distribution shift, or require costly intermediate annotations. We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining standard outcome-only rewards. Our key insight is to model agentic RAG reasoning as a rollout tree, where each reasoning step naturally maps to a node. This tree structure allows step utility to be estimated via Monte Carlo estimation over its descendant outcomes, yielding fine-grained process advantages without requiring intermediate labels. To make this paradigm practical, we introduce an efficient online tree construction strategy that preserves exploration diversity under a constrained computational budget. With a rollout cost comparable to strong baselines like Search-R1, experiments on seven multi-hop and general QA benchmarks across multiple model scales show that TreePS-RAG consistently and significantly outperforms both outcome-supervised and leading process-supervised RL methods.

Problem

Research questions and friction points this paper is trying to address.

Agentic RAG

reinforcement learning

process supervision

credit assignment

outcome-based reward

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-based Reinforcement Learning

Process Supervision

Agentic RAG