🤖 AI Summary
Large language models often generate redundant or even harmful intermediate reasoning text, which reduces efficiency and degrades accuracy. This work proposes OPSDC, a self-distillation method that guides the model to produce teacher logits under a "concise" instruction-conditioned mode and minimizes the token-wise reverse KL divergence against its original reasoning trajectory. Notably, OPSDC requires no external supervision, ground-truth answers, token budgets, or difficulty estimation. The approach automatically compresses reasoning substantially for simple problems while preserving necessary steps for complex ones. Evaluated on Qwen3-8B and Qwen3-14B, it achieves 57–59% token compression with 9–16 percentage points accuracy gain on MATH-500, and on AIME 2024, the 14B model attains 41% compression alongside a 10-point accuracy improvement.
📝 Abstract
Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a"be concise"instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token. Code is available at https://github.com/HJSang/OPSD_Reasoning_Compression.