🤖 AI Summary
In RL-based LLM tool learning, the pedagogical value of simple samples diminishes over time, and existing dynamic sampling methods struggle to accommodate multi-task architectures and fine-grained reward signals. To address these challenges, we propose a synergistic framework integrating dual dynamic sampling and curriculum learning. Specifically, we jointly model sampling dynamics along two orthogonal dimensions: (i) reward-aware sampling—weighting trajectories dynamically based on per-step reward mean and variance; and (ii) task-aware curriculum progression—sequencing subtasks according to mastery estimates. This work is the first to deeply integrate curriculum learning with multi-dimensional dynamic sampling in tool-use RL, while explicitly coupling fine-grained reward modeling. Evaluated on the BFCLv3 benchmark, our method achieves a +3.29% absolute performance gain over strong baselines, with concurrent improvements in training efficiency and cross-task generalization.
📝 Abstract
While reinforcement learning (RL) is increasingly used for LLM-based tool learning, its efficiency is often hampered by an overabundance of simple samples that provide diminishing learning value as training progresses. Existing dynamic sampling techniques are ill-suited for the multi-task structure and fine-grained reward mechanisms inherent to tool learning. This paper introduces Dynamic Sampling with Curriculum Learning (DSCL), a framework specifically designed to address this challenge by targeting the unique characteristics of tool learning: its multiple interdependent sub-tasks and multi-valued reward functions. DSCL features two core components: Reward-Based Dynamic Sampling, which uses multi-dimensional reward statistics (mean and variance) to prioritize valuable data, and Task-Based Dynamic Curriculum Learning, which adaptively focuses training on less-mastered sub-tasks. Through extensive experiments, we demonstrate that DSCL significantly improves training efficiency and model performance over strong baselines, achieving a 3.29% improvement on the BFCLv3 benchmark. Our method provides a tailored solution that effectively leverages the complex reward signals and sub-task dynamics within tool learning to achieve superior results.