PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenges of low utilization of long-horizon trajectories and sparse rewards in agent-based search training. To overcome these issues, the authors propose a unified learning framework that jointly optimizes the search policy and prefix-based answer evaluation within a single model, without requiring additional annotations or a separate reward model. By extracting prefix states from search trajectories to generate intermediate answers, the method constructs augmented training samples and derives step-level rewards, enabling efficient credit assignment and improved data utilization. The approach integrates prefix-based trajectory reuse, a shared architectural design, and a seamless fusion of reinforcement learning with multi-hop question answering. Extensive experiments demonstrate that the proposed method significantly outperforms strong baselines across multiple benchmarks, confirming its effectiveness and generalization capability.

Technology Category

Application Category

📝 Abstract

In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.

Problem

Research questions and friction points this paper is trying to address.

agentic search

reinforcement learning

reward sparsity

long-horizon rollouts

credit assignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

prefix-based rollout reuse

intermediate step rewards

agentic search