π€ AI Summary
Conventional reinforcement learning (RL) for training reasoning language models (LMs) relies heavily on the modelβs initial task-solving capability, suffers from inefficient exploration due to sparse rewards, and is primarily used as a teacher model for knowledge distillation (RLT)βnot for direct deployment.
Method: We propose a distillation-first RLT paradigm that replaces sparse-reward policy optimization with dense, prompt-driven reward signals and interpretable reasoning-path generation. It introduces a student-feedback closed-loop evaluation and test-time scaling distillation mechanism. Crucially, it eliminates dependence on the LMβs initial reasoning competence and enables zero-shot generalization to out-of-distribution tasks.
Contribution/Results: Our 7B-scale RLT achieves state-of-the-art performance on competition-level and graduate-level reasoning benchmarks. It surpasses baseline models with over 100Γ more parameters in both distillation efficacy and cold-start reasoning, significantly enhancing the reusability and scalability of RL-based reasoning frameworks.
π Abstract
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply"connect-the-dots"with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.