Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge that large language models (LLMs) in retrieval-augmented reasoning often struggle to credit individual steps in multi-step reasoning and retrieval processes due to reliance on sparse, final-answer-only rewards. To overcome this, the authors propose SLATE, a framework that generates trajectory pairs sharing a common prefix but differing only in the next action via truncated single-step sampling. SLATE introduces dense process rewards derived from a stronger LLM to provide fine-grained evaluation of each reasoning step, query, and intermediate answer, yielding more precise policy gradient signals. Theoretical analysis shows that this approach can reduce the variance of advantage estimation by up to a factor of T. Empirical results demonstrate consistent improvements over sparse-reward and existing process-reward baselines across seven question-answering benchmarks, with particularly notable gains on multi-hop tasks and smaller models.

Technology Category

Application Category

📝 Abstract

Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.

Problem

Research questions and friction points this paper is trying to address.

credit assignment

reinforcement learning

retrieval-augmented reasoning

process rewards

gradient variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

truncated step-level sampling

process rewards

LLM-as-judge