Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Humans excel at “learning in the job”—dynamically optimizing policies during task execution. This paper introduces Test-Time Curriculum Reinforcement Learning (TTC-RL), a framework enabling models to autonomously construct task-specific curricula during inference, select high-value samples from large-scale unlabeled data, and continuously fine-tune themselves to improve performance on target tasks. Its core innovation extends test-time learning into a goal-directed, online reinforcement training process spanning thousands of steps—fully unsupervised and annotation-free. TTC-RL integrates automatic curriculum selection with sparse-reward-driven policy optimization. Evaluated on mathematical reasoning (AIME25) and competitive programming (CodeElo) benchmarks, it significantly enhances Qwen3-8B: pass@1 improves by 1.8× and 2.1×, respectively, while pass@8 rises from 40% to 62% on AIME25 and from 28% to 43% on CodeElo.

Technology Category

Application Category

📝 Abstract

Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.

Problem

Research questions and friction points this paper is trying to address.

Automatically creates task-specific curricula for reinforcement learning

Selects relevant training data without human curation

Improves model performance on math and coding benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent creates task-specific curriculum automatically

Reinforcement learning applied during test-time training

Selects relevant data from large pool automatically

🔎 Similar Papers

No similar papers found.