CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

๐Ÿ“… 2025-10-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing curriculum learning methods for LLM reasoning training overlook the dynamic difficulty evolution of prompts and rely on simplistic filtering, leading to substantial computational waste. This work is the first to systematically model curriculum learning from a reinforcement learning gradient optimization perspective. We establish two key theoretical insights: (1) the prompt sampling distribution directly governs convergence rate, and (2) rollout resource allocation critically determines gradient stability. Building upon these, we propose a Bayesian posteriorโ€“based dynamic scheduling mechanism that jointly optimizes prompt selection and rollout allocation. Evaluated on 1.5B and 7B language models, our method achieves absolute improvements of +3.30 and +4.82 points over GRPO on reasoning benchmarks, respectively, while significantly accelerating convergence and reducing computational overhead.

Technology Category

Application Category

๐Ÿ“ Abstract
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by extbf{+3.30} points and extbf{+4.82} points with 1.5B and 7B models, respectively. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.
Problem

Research questions and friction points this paper is trying to address.

Optimizing curriculum learning for efficient reasoning LLM training
Improving prompt selection and rollout allocation via gradient analysis
Reducing computational waste in LLM training through Bayesian methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses gradient analysis to optimize prompt selection
Employs Bayesian posterior estimation for rollout allocation
Accelerates convergence while minimizing computational overhead
๐Ÿ”Ž Similar Papers
No similar papers found.