🤖 AI Summary
This work addresses a key limitation in existing reinforcement learning approaches for training agent-based RAG systems: the uniform sampling of all trajectories, which overlooks the varying density of retrieval supervision signals across different search depths. To remedy this, the authors propose CuSearch, a novel framework that leverages trajectory search depth as an unlabeled yet reliable proxy for supervision density. CuSearch introduces a curriculum-based replay strategy—implemented via the Search-Depth Greedy Allocation (SDGA) operator, with SDGA-Auto and SDGA-Phase variants—that prioritizes deeper-search trajectories under a fixed update budget, dynamically adapting to shifts in depth distribution during training. Integrated with the RLVR reward mechanism and an agent RAG architecture, the method achieves substantial performance gains across diverse models and retrieval settings, yielding up to an 11.8-point improvement in exact match accuracy over standard GRPO on ZeroSearch tasks.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training. The code is available at https://github.com/MrToser/CuSearch.