Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the efficiency challenges posed by large-scale heterogeneous data in reinforcement learning–based post-training of large language models. The authors propose the Actor-Curator framework, which uniquely integrates policy improvement objectives with curriculum learning. Central to this approach is a neural curator that formulates dynamic problem selection as a non-stationary stochastic bandit problem and derives an optimization loss via online stochastic mirror descent. This enables fully automated, scalable co-adaptive curriculum learning with regret guarantees under partial feedback. Experiments demonstrate that the method achieves relative performance improvements of 28.6% and 30.5% on AIME2024 and ARC-1D benchmarks, respectively, while accelerating training by up to 80%, significantly outperforming uniform sampling and strong existing baselines.

Technology Category

Application Category

📝 Abstract

Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

Problem

Research questions and friction points this paper is trying to address.

curriculum learning

reinforcement learning

large language models

post-training

problem selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

curriculum learning

reinforcement learning

bandit optimization