🤖 AI Summary
Humans excel at “learning in the job”—dynamically optimizing policies during task execution. This paper introduces Test-Time Curriculum Reinforcement Learning (TTC-RL), a framework enabling models to autonomously construct task-specific curricula during inference, select high-value samples from large-scale unlabeled data, and continuously fine-tune themselves to improve performance on target tasks. Its core innovation extends test-time learning into a goal-directed, online reinforcement training process spanning thousands of steps—fully unsupervised and annotation-free. TTC-RL integrates automatic curriculum selection with sparse-reward-driven policy optimization. Evaluated on mathematical reasoning (AIME25) and competitive programming (CodeElo) benchmarks, it significantly enhances Qwen3-8B: pass@1 improves by 1.8× and 2.1×, respectively, while pass@8 rises from 40% to 62% on AIME25 and from 28% to 43% on CodeElo.
📝 Abstract
Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.