VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing policy-based reinforcement learning (RL) methods—such as GRPO and DAPO—ignore sample difficulty heterogeneity when enhancing large language models’ (LLMs’) mathematical reasoning capabilities, contradicting the human cognitive principle of “progressing from easy to hard.” Method: We propose the first curriculum learning framework that dynamically assesses sample difficulty via rollout reward variance. This variance quantifies model uncertainty per sample, enabling an adaptive difficulty-scheduling mechanism tightly integrated with policy gradient optimization. Crucially, reward variance serves as a novel, annotation-free proxy for difficulty, facilitating fully automated, dynamic curriculum generation. Results: Evaluated across five mathematical reasoning benchmarks and two major LLM families, our method consistently outperforms state-of-the-art RL baselines—including GRPO and DAPO—achieving significant gains in both training efficiency and final performance. To our knowledge, this is the first work to leverage reward variance as a difficulty signal for curriculum learning in LLM alignment.

Technology Category

Application Category

📝 Abstract
Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.
Problem

Research questions and friction points this paper is trying to address.

Existing RL methods ignore difficulty progression in mathematical reasoning
Current approaches fail to mimic human learning from easy to hard
Lack of dynamic difficulty adjustment based on model learning capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum RL using reward variance for difficulty
Dynamically controls sample difficulty during training
Variance-based grouping to optimize learning progression
🔎 Similar Papers
No similar papers found.