Improving Search Agent with One Line of Code

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the training instability and model collapse in tool-augmented reinforcement learning caused by importance sampling–induced distributional drift (ISDD). To mitigate this issue, the authors propose SAPO, a method that introduces a conditional token-level KL constraint within the Group Relative Policy Optimization (GRPO) framework, penalizing only low-probability positive tokens. This design enables stable policy updates through a minimal single-line code modification, effectively curbing distributional drift while preserving informative gradient flow. Experimental results demonstrate that SAPO achieves an average absolute accuracy improvement of 10.6% (a relative gain of 31.5%) across seven question-answering benchmarks, with consistent performance gains observed across model scales (1.5B and 14B parameters) and architectures, including Qwen and LLaMA families.

Technology Category

Application Category

📝 Abstract
Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).
Problem

Research questions and friction points this paper is trying to address.

training instability
importance sampling distribution drift
catastrophic model collapse
tool-based agentic reinforcement learning
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Importance Sampling Distribution Drift
Conditional KL Constraint
Tool-based Agentic Reinforcement Learning
Policy Optimization
Search Agent
🔎 Similar Papers
No similar papers found.
J
Jian Li
Nanjing University
Dongsheng Chen
Dongsheng Chen
Technical University of Munich
GISSpatial analysisGeographyUrban Planning
Z
Zhenhua Xu
Tencent Youtu Lab
Y
Yizhang Jin
Tencent Youtu Lab
Jiafu Wu
Jiafu Wu
Tencent Youtu Lab
AIGCLLM
C
Chengjie Wang
Tencent Youtu Lab
X
Xiaotong Yuan
Nanjing University
Y
Yabiao Wang
Tencent Youtu Lab