Self-Evolving Curriculum for LLM Reasoning

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing curriculum design methods for reinforcement learning (RL) fine-tuning of large language models (LLMs) suffer from high randomness, heavy human reliance, or unstable online filtering, leading to inefficient training. Method: We propose the Self-Evolving Curriculum (SEC) framework, which—uniquely—formulates curriculum selection as a non-stationary multi-armed bandit problem, jointly optimizing both the curriculum policy and the RL policy. SEC introduces the absolute advantage of policy gradients as a surrogate signal for immediate learning gain and dynamically adjusts the training sequence via TD(0) temporal-difference updates and problem abstraction based on difficulty and task type. Contribution/Results: Experiments demonstrate significant performance gains across three core reasoning domains—planning, inductive reasoning, and mathematics—along with improved generalization to hard and out-of-distribution samples and enhanced balance across multi-task skills.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Optimizing training curriculum order for LLM reasoning enhancement
Automating curriculum learning to replace heuristic manual designs
Improving generalization to harder out-of-distribution reasoning problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Evolving Curriculum for automatic training
Non-stationary Multi-Armed Bandit problem formulation
TD(0) method updates curriculum policy dynamically
🔎 Similar Papers
No similar papers found.
X
Xiaoyin Chen
Mila – Quebec AI Institute, Université de Montréal
J
Jiarui Lu
Mila – Quebec AI Institute, Université de Montréal
M
Minsu Kim
Mila – Quebec AI Institute, KAIST
D
Dinghuai Zhang
Mila – Quebec AI Institute, Microsoft Research
J
Jian Tang
Mila – Quebec AI Institute, HEC Montréal
A
Alexandre Pich'e
ServiceNow Research
Nicolas Gontier
Nicolas Gontier
ServiceNow Research
Dialog SystemsReasoningDeep LearningNatural Language Processing
Yoshua Bengio
Yoshua Bengio
Professor of computer science, University of Montreal, Mila, IVADO, CIFAR
Machine learningdeep learningartificial intelligence
Ehsan Kamalloo
Ehsan Kamalloo
Research Scientist at ServiceNow AI Research
Natural Language ProcessingInformation RetrievalMachine Learning