🤖 AI Summary
Standard online policy distillation often suffers from training instability and negative transfer when transferring reasoning capabilities to smaller models. This work proposes REOPOLD, a framework that reframes distillation as a policy optimization process. By integrating mixed reward clipping, entropy-driven token-level dynamic sampling, and a unified exploration-refinement training strategy, REOPOLD enables relaxed online distillation that effectively mitigates optimization difficulties caused by strict imitation. Experimental results demonstrate that the proposed method significantly outperforms baseline approaches across mathematical reasoning, vision tasks, and agent-based tool usage, achieving 6.7–12× higher sample efficiency. Notably, a 7B-parameter student model trained with REOPOLD matches the performance of a 32B-parameter teacher while delivering a 3.32× speedup in inference.
📝 Abstract
On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.