Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses a key limitation in existing curriculum learning approaches for reinforcement fine-tuning of large language models, which rely on external heuristics or auxiliary models for difficulty assessment and often become misaligned with the evolving policy dynamics. To overcome this, the authors propose METIS, a novel framework that internalizes curriculum decisions as a form of metacognitive capability within the model itself. METIS quantifies sample informativeness via prompt-intrinsic reward variance and adaptively schedules training samples by predicting this metric based on recent learning dynamics. The framework jointly optimizes task-specific rewards and self-assessed curriculum rewards, establishing a lightweight, closed-loop curriculum paradigm. Evaluated across diverse discrete and continuous reinforcement fine-tuning benchmarks—including mathematical reasoning, code generation, and agent function calling—METIS achieves substantial performance gains and accelerates convergence by up to 67%.
📝 Abstract
In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the policy's training dynamics. In this paper, we introduce METIS (METacognitive Internalized Self-judgment), a novel framework that internalizes curriculum judgment as a native capability. Leveraging a critical observation that within-prompt reward variance effectively gauges prompt informativeness, METIS predicts this metric based on recent training outcomes as lightweight in-context learning examples. This intrinsic self-judgment then dynamically dictates the training allocation. Moreover, METIS closes the loop between judgment and optimization by jointly optimizing the standard RFT rewards and a self-judgment reward. This allows the policy to learn what to learn next, as a form of metacognition. Across extensive discrete and continuous RFT benchmarks from mathematical reasoning, code generation, to agentic function-calling, METIS consistently delivers superior performance while accelerating convergence by up to 67%. By bypassing handcrafted heuristics and auxiliary models, our work establishes a simple, closed-loop, and highly efficient curriculum internalization paradigm for LLM reinforcement fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

curriculum learning
reinforcement fine-tuning
LLM
misalignment
heuristics
Innovation

Methods, ideas, or system contributions that make the work stand out.

curriculum learning
reinforcement fine-tuning
metacognition
self-judgment
in-context learning