π€ AI Summary
This work addresses the challenges of low utilization of long-horizon trajectories and sparse rewards in agent-based search training. To overcome these issues, the authors propose a unified learning framework that jointly optimizes the search policy and prefix-based answer evaluation within a single model, without requiring additional annotations or a separate reward model. By extracting prefix states from search trajectories to generate intermediate answers, the method constructs augmented training samples and derives step-level rewards, enabling efficient credit assignment and improved data utilization. The approach integrates prefix-based trajectory reuse, a shared architectural design, and a seamless fusion of reinforcement learning with multi-hop question answering. Extensive experiments demonstrate that the proposed method significantly outperforms strong baselines across multiple benchmarks, confirming its effectiveness and generalization capability.
π Abstract
In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.