🤖 AI Summary
This work addresses the training instability and model collapse in tool-augmented reinforcement learning caused by importance sampling–induced distributional drift (ISDD). To mitigate this issue, the authors propose SAPO, a method that introduces a conditional token-level KL constraint within the Group Relative Policy Optimization (GRPO) framework, penalizing only low-probability positive tokens. This design enables stable policy updates through a minimal single-line code modification, effectively curbing distributional drift while preserving informative gradient flow. Experimental results demonstrate that SAPO achieves an average absolute accuracy improvement of 10.6% (a relative gain of 31.5%) across seven question-answering benchmarks, with consistent performance gains observed across model scales (1.5B and 14B parameters) and architectures, including Qwen and LLaMA families.
📝 Abstract
Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).