On-Policy Self-Distillation for Reasoning Compression

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Large language models often generate redundant or even harmful intermediate reasoning text, which reduces efficiency and degrades accuracy. This work proposes OPSDC, a self-distillation method that guides the model to produce teacher logits under a "concise" instruction-conditioned mode and minimizes the token-wise reverse KL divergence against its original reasoning trajectory. Notably, OPSDC requires no external supervision, ground-truth answers, token budgets, or difficulty estimation. The approach automatically compresses reasoning substantially for simple problems while preserving necessary steps for complex ones. Evaluated on Qwen3-8B and Qwen3-14B, it achieves 57–59% token compression with 9–16 percentage points accuracy gain on MATH-500, and on AIME 2024, the 14B model attains 41% compression alongside a 10-point accuracy improvement.

Technology Category

Application Category

📝 Abstract

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a"be concise"instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token. Code is available at https://github.com/HJSang/OPSD_Reasoning_Compression.

Problem

Research questions and friction points this paper is trying to address.

reasoning compression

self-distillation

redundant reasoning

token efficiency

error compounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation

reasoning compression

reverse KL divergence