π€ AI Summary
This work addresses the inefficiency of small language models in generating verbose and computationally expensive intermediate reasoning trajectories, which hinders their practical deployment. To this end, the authors propose a Mixed-strategy Policy Distillation (MPD) framework, wherein a large teacher model online rewrites theειΏ trajectories sampled by a smaller student model into concise and effective reasoning paths. The student model then aligns with these compressed trajectories via KL divergence, preserving its exploratory capability while learning to reason more efficiently. MPD synergistically combines the benefits of online and offline distillation, circumventing explicit length constraints and distributional mismatch issues. Experimental results demonstrate that MPD reduces reasoning token consumption by up to 27.1% on Qwen3-1.7B and consistently improves performance across multiple reasoning benchmarks, significantly enhancing both the efficiency and effectiveness of small models.
π Abstract
Reasoning-centric large language models (LLMs) achieve strong performance by generating intermediate reasoning trajectories, but often incur excessive token usage and high inference-time decoding cost. We observe that, when solving the same problems, larger reasoning models can often produce more concise traces, whereas smaller reasoning models tend to generate longer and more redundant trajectories. This is especially problematic in real-world deployment, where memory, latency, and serving-cost constraints often favor smaller models. Our observations suggest that reasoning compression can be transferred from large models to small ones rather than enforced through explicit length constraints. Based on this insight, we propose Mixed-Policy Distillation (MPD), a reasoning compression framework that transfers concise reasoning behavior from a larger-sized teacher to a smaller student by distilling teacher-compressed student trajectories. Unlike on-policy distillation, which aligns the student with teacher distributions over verbose student trajectories, or off-policy distillation, which relies on teacher-generated trajectories and may suffer from distribution mismatch, MPD combines the strengths of both. Given a student-sampled trajectory, the teacher rewrites it into a more concise reasoning trace, and the student is trained via KL-based alignment on the compressed trajectory. This preserves student-policy exploration while injecting teacher-guided compression. Experiments on Qwen3-1.7B show that MPD reduces token usage by up to 27.1% while improving performance across multiple reasoning benchmarks, demonstrating an effective approach to efficient small-model reasoning.