Reinforcement Learning Teachers of Test Time Scaling

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Conventional reinforcement learning (RL) for training reasoning language models (LMs) relies heavily on the model’s initial task-solving capability, suffers from inefficient exploration due to sparse rewards, and is primarily used as a teacher model for knowledge distillation (RLT)—not for direct deployment. Method: We propose a distillation-first RLT paradigm that replaces sparse-reward policy optimization with dense, prompt-driven reward signals and interpretable reasoning-path generation. It introduces a student-feedback closed-loop evaluation and test-time scaling distillation mechanism. Crucially, it eliminates dependence on the LM’s initial reasoning competence and enables zero-shot generalization to out-of-distribution tasks. Contribution/Results: Our 7B-scale RLT achieves state-of-the-art performance on competition-level and graduate-level reasoning benchmarks. It surpasses baseline models with over 100× more parameters in both distillation efficacy and cold-start reasoning, significantly enhancing the reusability and scalability of RL-based reasoning frameworks.

Technology Category

Application Category

📝 Abstract

Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply"connect-the-dots"with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.

Problem

Research questions and friction points this paper is trying to address.

Training reasoning LMs with RL faces exploration challenges

Need effective teachers for distillation and cold-starting RL

Current methods lag behind in efficiency and re-usability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement-Learned Teachers avoid RL exploration challenges

RLTs provide dense rewards for effective student distillation

7B RLT outperforms larger LMs in performance tasks

🔎 Similar Papers

No similar papers found.