Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

For small language models (1.5B–3B), standard reinforcement learning (RL) yields marginal gains on mathematical and code reasoning tasks. To address this, we propose E2H Reasoner—a novel RL training framework that introduces the first systematic Easy-to-Hard curriculum learning paradigm for LLM reasoning. Methodologically, it integrates task difficulty grading, dynamic scheduling, conditional task decomposition, and approximate policy iteration RL. We theoretically establish its convergence guarantees and sample efficiency advantages, and further reveal that progressive annealing of easy tasks plays a critical role in mitigating overfitting. Experiments demonstrate that E2H Reasoner significantly outperforms existing RL baselines across multiple reasoning benchmarks—including GSM8K, MATH, and HumanEval—substantially boosting accuracy for small models, whereas standard RL training delivers negligible improvement.

Technology Category

Application Category

📝 Abstract

We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Improving reasoning in LLMs via easy-to-hard RL training

Preventing overfitting by fading out easy tasks gradually

Reducing sample complexity with curriculum-based task decomposition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum learning schedules tasks from easy to hard

Fading out easy tasks prevents overfitting effectively

Theoretical convergence guarantees with fewer total samples

🔎 Similar Papers

No similar papers found.