DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of large language models (LLMs) in proactively retrieving external knowledge for complex tasks and the instability—such as performance saturation and training collapse—commonly observed when applying reinforcement learning (RL) to multi-step interactive reasoning, this paper proposes Dynamic Sample Filtering-based Sequence-level Policy Optimization (DSPO). DSPO is a fully end-to-end, sequence-level RL algorithm that jointly optimizes multi-turn search and logical reasoning without requiring supervised demonstrations. It introduces a dynamic sample filtering mechanism to enhance training stability and convergence. To our knowledge, DSPO is the first method to achieve joint optimization of multi-step knowledge retrieval and logical inference under pure RL supervision. Evaluated on multi-hop question answering benchmarks including HotpotQA, a 7B-parameter model trained with DSPO achieves a 34.1% absolute improvement over prior state-of-the-art methods and outperforms a 14B-parameter baseline by nearly 9%, significantly advancing the capabilities of small-scale models in agent-based question answering.

Technology Category

Application Category

📝 Abstract
Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce extbf{D}ynamic-filter extbf{S}equence-level extbf{P}olicy extbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by extbf{34.1%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly extbf{9% relative}, maintaining exceptional training stability.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs with active external knowledge search
Overcoming performance ceilings in RL for complex tasks
Enabling stable multi-turn search and reasoning without supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sample filtering for stable RL training
Sequence-level policy optimization for agentic search
Pure RL training without supervised demonstration data
🔎 Similar Papers
No similar papers found.