🤖 AI Summary
Existing policy-based reinforcement learning (RL) methods—such as GRPO and DAPO—ignore sample difficulty heterogeneity when enhancing large language models’ (LLMs’) mathematical reasoning capabilities, contradicting the human cognitive principle of “progressing from easy to hard.”
Method: We propose the first curriculum learning framework that dynamically assesses sample difficulty via rollout reward variance. This variance quantifies model uncertainty per sample, enabling an adaptive difficulty-scheduling mechanism tightly integrated with policy gradient optimization. Crucially, reward variance serves as a novel, annotation-free proxy for difficulty, facilitating fully automated, dynamic curriculum generation.
Results: Evaluated across five mathematical reasoning benchmarks and two major LLM families, our method consistently outperforms state-of-the-art RL baselines—including GRPO and DAPO—achieving significant gains in both training efficiency and final performance. To our knowledge, this is the first work to leverage reward variance as a difficulty signal for curriculum learning in LLM alignment.
📝 Abstract
Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.