TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG

๐Ÿ“… 2026-01-11
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of credit assignment in reinforcement learning (RL)-based retrieval-augmented generation (RAG) agents, which typically rely solely on sparse final rewards and struggle to attribute credit to intermediate reasoning and retrieval steps. While existing process supervision approaches mitigate this issue, they often depend on offline data or costly human annotations and are vulnerable to distributional shift. To overcome these limitations, we propose an online tree-based process supervision framework that models the reasoning trajectory as an unfolding tree and leverages Monte Carlo estimates of descendant outcomes to enable fine-grained, step-level credit assignment using only final rewardsโ€”without requiring intermediate labels. Combined with efficient tree construction and advantage function optimization, our method achieves significant performance gains over current outcome- and process-supervised RL baselines across seven multi-hop and general question-answering benchmarks spanning diverse model scales, matching the inference cost of Search-R1.

Technology Category

Application Category

๐Ÿ“ Abstract
Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval, and has recently been advanced by reinforcement learning (RL) with outcome-based supervision. While effective, relying solely on sparse final rewards limits step-wise credit assignment and provides weak guidance for intermediate reasoning and actions. Recent efforts explore process-level supervision, but typically depend on offline constructed training data, which risks distribution shift, or require costly intermediate annotations. We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining standard outcome-only rewards. Our key insight is to model agentic RAG reasoning as a rollout tree, where each reasoning step naturally maps to a node. This tree structure allows step utility to be estimated via Monte Carlo estimation over its descendant outcomes, yielding fine-grained process advantages without requiring intermediate labels. To make this paradigm practical, we introduce an efficient online tree construction strategy that preserves exploration diversity under a constrained computational budget. With a rollout cost comparable to strong baselines like Search-R1, experiments on seven multi-hop and general QA benchmarks across multiple model scales show that TreePS-RAG consistently and significantly outperforms both outcome-supervised and leading process-supervised RL methods.
Problem

Research questions and friction points this paper is trying to address.

Agentic RAG
reinforcement learning
process supervision
credit assignment
outcome-based reward
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-based Reinforcement Learning
Process Supervision
Agentic RAG
Step-wise Credit Assignment
Rollout Tree
Tianhua Zhang
Tianhua Zhang
The Chinese University of Hong Kong
natural language processinglarge language models
Kun Li
Kun Li
The Chinese University of Hong Kong
Natural Language ProcessingLarge Language Model
J
Junan Li
The Chinese University of Hong Kong, Hong Kong SAR, China
Y
Yunxiang Li
The Chinese University of Hong Kong, Hong Kong SAR, China
Hongyin Luo
Hongyin Luo
MIT CSAIL
Artificial IntelligenceMachine LearningNatural Language Processing
Xixin Wu
Xixin Wu
The Chinese University of Hong Kong
J
James R. Glass
Massachusetts Institute of Technology, Cambridge MA, USA
H
Helen Meng
The Chinese University of Hong Kong, Hong Kong SAR, China