Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

๐Ÿ“… 2026-02-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the efficiency challenges posed by large-scale heterogeneous data in reinforcement learningโ€“based post-training of large language models. The authors propose the Actor-Curator framework, which uniquely integrates policy improvement objectives with curriculum learning. Central to this approach is a neural curator that formulates dynamic problem selection as a non-stationary stochastic bandit problem and derives an optimization loss via online stochastic mirror descent. This enables fully automated, scalable co-adaptive curriculum learning with regret guarantees under partial feedback. Experiments demonstrate that the method achieves relative performance improvements of 28.6% and 30.5% on AIME2024 and ARC-1D benchmarks, respectively, while accelerating training by up to 80%, significantly outperforming uniform sampling and strong existing baselines.

Technology Category

Application Category

๐Ÿ“ Abstract
Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.
Problem

Research questions and friction points this paper is trying to address.

curriculum learning
reinforcement learning
large language models
post-training
problem selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

curriculum learning
reinforcement learning
bandit optimization
large language models
policy improvement
๐Ÿ”Ž Similar Papers
No similar papers found.